A Hybrid Deep Learning Architecture for Sentence Unit Detection45031

Our model profits from the advantage of two dominant deep learning architectures: i the ability to learn the long dependencies in both directions of a bidirectional Long Short-Term Memor

Trang 1

A Hybrid Deep Learning Architecture for Sentence Unit Detection

Duy-Cat Can∗†, Thi-Nga Ho‡ and Eng-Siong Chng†‡

∗Faculty of Information Technology, University of Engineering and Technology, VNUH, Vietnam

†Temasek Laboratories@NTU, Nanyang Technological University, Singapore

‡School of Computer Science and Engineering, Nanyang Technological University, Singapore

catcd@vnu.edu.vn, ngaht@ntu.edu.sg, ASESChng@ntu.edu.sg

Abstract—Automatic speech recognition systems currently

deliver an unpunctuated sequence of words which is hard

to peruse for human and degrades the performance of the

downstream natural language processing tasks In this paper,

we propose a hybrid approach for Sentence Unit Detection,

in which the focus is on adding the full stop [.] to the

un-structured text Our model profits from the advantage of two

dominant deep learning architectures: (i) the ability to learn

the long dependencies in both directions of a bidirectional

Long Short-Term Memory; (ii) the ability to capture the

local context with Convolutional Neural Networks We also

empirically study the training objective of our networks using

extra-loss and further investigate the impacts of each model

component on the overall result Experiments conducted

on two large-scale datasets demonstrated that the proposed

architecture outperforms previous separated methods by a

substantial margin of 1.82-1.91% of F1

Availability: the source code and model are available at

https://github.com/catcd/LSTM-CNN-SUD

Keywords-Sentence Unit Detection; Punctuation; Recurrent

Neural Networks; Long Short-Term Memory; Convolutional

Neural Network;

I INTRODUCTION Recent years have witnessed tremendous progress in

automatic speech recognition (ASR) However, the text

transcript generated by current recognition systems is

still simply a stream of words without punctuation and

segmentation Generally, the human readability of the

tran-script can be comprehensively improved by the existence

of punctuation marks [1], and the segmentation of the

text based on punctuation positions will also increase the

accuracy of post-processing such as question answering or

machine translation In this work, we approach Sentence

Unit Detection (SUD) problem as a sequential tagging,

which aims to segment the sequence of words by labeling

the end of each sentence a full stop mark Besides, we

further try to predict fined-grain punctuation mark

In the last decade, deep learning methods have produced

state-of-the-art results in many tasks of natural language

processing (NLP) by manipulating multiple hidden layers

to learn the robust representations of data+ Two most

typ-ical deep neural networks (DNNs) are the Convolutional

Neural Network (CNN) [2] and the Recurrent Neural

Networks, or RNNs [3], with Long Short-Term Memory

(LSTM) unit [4] CNN is good at capturing n-gram

fea-tures in the flat structure and has been proved effective in

NLP [5] RNN performs effectively on sequential data, and

it furthermore has many different improvements among

several state-of-the-art NLP systems [6]

In this work, we present an analysis of a neural architec-ture that takes advantage of these two DNNs The hybrid model benefits from far context capturing ability of a multilayer bidirectional LSTM and local features extracted

by the CNN

Compared with the prior works, the contributions of our work can be summarized as follows:

• We propose a hybrid model of LSTM-CNN and show that our model is effective in detecting sentence units and punctuations on two corpora

• We demonstrate the effectiveness of our proposed extra-loss on training the proposed model

• We try to handle the class imbalance problem by using weighted Cross-Entropy

II RELATED WORK There has been a considerable amount of efforts on constructing computational models to detect the sentence unit in unpunctuated text Most of the recent works can

be divided into two categories: hand-crafted feature based and automatic features extracted methods

Previous researches frequently used lexical features include bag-of-words or n-grams model [7] These tech-niques have been compared with ConvNets by Zhang and LeCun [8] When training, some state-of-the-art SUD approaches chose decision tree [9] or Conditional Random Fields (CRFs) [10] as the classifier These approaches only take traditional lexical features as input which depend on expert knowledge, thus normally cannot generalize data well enough because of expensive human exertion

In recent years, many studies attempt other possibilities

by using the word embeddings [11] with various DNN architectures to learn the features without prior knowledge Che et al [11] applied a CNN model on purely lexical, with pre-trained word vectors as the only input Study of Tilk and Alum¨ae [12] presented a two-stage RNN based model using LSTM units, showed the performance of LSTM on SUD task Many enhancements and techniques are also used to improve LSTM, such as CRF on the LSTM output [13, 14] or attention mechanism [15] Meanwhile, some recent experiments have shown the effectiveness of stacking the CNN on the output of LSTM

in the task of relation classification or sentiment analysis [16, 17] Inspired by these experiments, we attempt this combination in a different way for sentence unit detection

in this paper

Trang 2

Figure 1 An overview of proposed model

III PROPOSE MODEL Figure 1 depicts the overall architecture of our

pro-posed model Given the unpunctuated text as input, it is

passed through an embedding vector generation layer for

our neural network Along the sequence of words, two

recurrent neural networks with Long Short-Term Memory

units are applied to learn hidden representations of words

in the embedded space respectively A convolution layer

is also applied to capture local features from words and

their neighbors

We have a multi-softmax layer on the output of the

previous phase for classification During the training stage,

the hidden states of LSTM and local features from CNN

are concatenated, and a fine-grained softmax layer is

followed to perform a (K + 1)-class classification

Addi-tionally, two coarse-grained softmax classifiers of LSTM

and CNN are also used to perform binary classifications

The final (K + 1)-class distribution is the (K + 1)-class

distribution provided by fine-grained classifier during the

testing stage The details of each layer are described below

A Embeddings

In the Embeddings layer, each word in the input

se-quence is transformed into a vector by looking up the

embedding matrix We∈ Rd ×|V |, where d is the dimension

of a vector, and V is a vocabulary of all words we consider

The embedding matrix is generated by using a pre-trained word embedding model which learned the word vectors captured hidden information about a language, such as word analogies or semantic, based on its external context For this paper, we use the fastText word embed-ding model [18] which is trained on Wikipedia data

B Features extraction 1) Multilayer Bidirectional Long Short-Term Memory:

To take advantage of sequential data, we make use of Recurrent Neural Network [3] with Long Short-Term Memory unit [4], which is demonstrated the effective-ness in capturing the long-term dependencies A common LSTM unit is composed of four components: a memory cell ct, an input gate it, an output gate ot, and a forget gate ft The hidden state ht is calculated using current input xtwith previous hidden state ht −1and memory cell

ct −1, as follow:

it= σ (Wixt+ Uiht−1+ bi) (1)

gt= tanh (Wcxt+ Ucht −1+ bc) (2)

ft= σ (Wfxt+ Ufht−1+ bf) (3)

ct= it◦ gt+ ft◦ ct −1 (4)

ot= σ (Woxt+ Uoht −1+ bo) (5)

In which, σ denotes the sigmoid function, and ◦ denotes the entry-wise product

2) Local features with Convolutional Neural Network:

To improve the performance of LSTM model, we use a CNN [2] layer to capture the context features around each word We use several filter’s region sizes for this CNN layer which allow CNN model to capture wider ranges of n-grams Local features ltfor the tth word in the context

of 2n neighbors can be extracted, utilizing convolution filter size d × (2n + 1) I.e,

lt= f (Wconvxt−n:t+n+ bconv) (7) where Wconv is the weight matrix for the convolution layer; bconv is bias for the hidden state vector; xt−n:t+n

is stack of 2n + 1 word vectors from (t − n) to (t + n);

f is a non-linear activation function

C Multi-softmax classifier

A fine-grained softmax classifier is used to predict a (K + 1)-class distribution ytfor each word,

yt= softmax (Wf[ht⊕ lt] + bf) (8) where Wf is the transformation matrix, and bf is the bias vector Fine-grained classifier makes use of represen-tation with bidirectional information combined with local features This (K +1)-class distribution then become final prediction on the decoding phase

Two coarse-grained softmax classifiers are applied to

htand ltseparately with linear transformation to give the binary distribution yh

t and yl

t respectively, i.e,

yht = softmax (Whht+ bh) (9)

ylt= softmax (Wllt+ bl) (10)

Trang 3

Table I

S UMMARY OF TWO BENCHMARK DATASETS

Train Dev Test Train Dev Example 5359 359 275 284436 6433

Non-punctuated 299693 18454 12808 5722741 112190

Full stop [.] 25573 1508 1162 582004 12368

Comma [,] - - - 384732 8519

Question mark [?] - - - 109757 2714

Exclamation mark [!] - - - 71598 2061

Three dots [ ] - - - 30607 918

where Wc is the transformation matrix, and bc is

the bias vector Classifying tth word into two

coarse-classes (punctuated vs non-punctuated) can strengthen the

model’s ability to judge the fine-class

D Objective function and learning method

The two binary softmax classifiers are used to estimate

the probability that word is punctuation The (K +

1)-class softmax 1)-classifier is used to estimate the probability

of which punctuation type that word belongs to For a

single word in a data sample, the training objective is the

penalized cross-entropy of three classifiers, given by

L =−

K

X

i=0

utilog yti−

1 X i=0

vtilog ytih−

1 X i=0

vtilog ytil+λkθk2

(11) where ut ∈ R(K+1), vt ∈ R2, indicating the one-hot

represented ground truth; θ is the set of model parameters

to be learned, and λ is a regularization coefficient The

model parameters θ can be efficiently computed via

back-propagation through neural network structures To

mini-mize L, we apply mini-batch gradient descent with Adam

optimizer [19] in our experiments

IV EXPERIMENTAL EVALUATION

A Datasets

We evaluate our LSTM-CNN model on two benchmark

datasets: RT-03-041and the subset of MGB Challenge data

[20] The details of two datasets are shown in Table I

The RT-03-04 corpus consists of transcripts and

anno-tations of 40 of hours English Broadcast News (BN) and

Conversational Telephone Speech (CTS) audio data For

this dataset, we predict a label full stop [.] for each word

that is at the end of a full sentence The model is

fine-tuned using training set with validation on development

set, and the results are reported on the test set, which is

kept secret with the model

Our subset of MGB data includes approximately 1,340

hours over 1,600 hours of broadcast audio taken from four

BBC TV channels on seven weeks Since the provided

dataset does not contain punctuation marks, we use the

original subtitles and preprocessed subtitles to obtain the

correct boundaries for each unit In these experiments, we

predicted a fine-class label for each word, include full stop

[.], comma [,], question mark [?], exclamation mark

[!]and three dots [ ] With MGB, we separate ten

percent of training set for validation and report the result

on development set

1 MDE Training Data Speech LDC2004S08 and LDC2005S16.

Table II

R ESULTS ON RT-03-04 DATASET

CNN 82.28 52.45 64.06( + − 1.54) LSTM 81.92 66.70 73.53( + − 1.02) LSTM-CNN 79.56 70.36 74.68( + − 0.36) LSTM-CNN + 81.47 70.24 75.44( + − 0.27)

Table III

R ESULTS ON MGB DATASET

CNN 67.88 35.62 46.72( + − 0.47) 36.39( + − 0.61) LSTM 65.62 58.73 61.98( + − 0.09) 45.19( + − 0.61) LSTM-CNN 65.20 60.70 62.87( + − 0.11) 48.09( + − 0.25) LSTM-CNN + 67.65 60.6 63.80( + − 0.04) 49.49( + − 0.36)

Table IV

F 1 OF EACH LABEL ON MGB DATASET

CNN 59.91 38.52 34.99 29.04 19.50 LSTM 74.89 56.86 71.21 19.66 3.35 LSTM-CNN 74.45 57.96 69.78 27.99 10.25 LSTM-CNN + 74.91 59.21 70.87 29.49 12.96

B Performance of the LSTM-CNN model

We conduct the training and testing process 20 times and calculate the averaged results For evaluation, the predicted labels were compared to the golden annotated data using standard precision (P ), recall (R), and F1score metrics

Table II and III show the performance of our proposed model with different variants on two benchmark datasets

In both RT-03-04 and MGB datasets, the ensemble LSTM-CNN model outperforms all uni-models using LSTM-CNN or LSTM only

On RT-03-04, the local features extracted from CNN help to increase the recall of the LSTM model by 3.66%; the F1 is increased by 1.15% The result on the MGB dataset is similar, the recall and the micro-average F1 are increased by 1.97% and 0.89% respectively The standard deviations of 20 runs on two datasets are 0.36 on RT-03-04 and 0.11 on MGB (micro-average)

In addition, applying the extra-loss using two coarse-grained softmax to final training objective helps to boost

F1 0.76% and 0.93% on two datasets respectively Our LSTM-CNN+model is more stable and outperforms other models by a large margin

Table IV compares the result of each class on the MGB dataset The LSTM model with long dependency infor-mation fails to predict minor classes, such as three dots,

in the testing dataset Under other conditions, the CNN model, with local features, able to capture the features to resolve this problem Therefore, the out-performance of the proposed combined model in detecting minor classes

is remarkable and understandable

C Handle data imbalance Since we find out that all models achieve much higher precision than recall, we make one additional adjustment

to our better performers on RT-03-04: reduce the contri-bution to objective function for class “Non-punctuated”

Trang 4

Figure 2 Investigate the impact of weighted loss on the RT-03-04

dataset using LSTM-CNN + model

By this effort, more border cases would be classed as

“Punctuated”, which balanced the precision and recall

Figure 2 shows that with the increase of the ratio for class

“Non-punctuated”, the precision and recall increase and

decrease respectively The best-achieved result is 76.68%

of F1 (P = 75.57%, R = 77.82%) with the ratio of two

classes are 0.35-0.65

V CONCLUSION

In this paper, we have presented a novel sentence

unit detection model that consists of two dominant deep

learning networks The proposed model takes advantage of

the ability to capture long-range dependencies on multiple

time scales of LSTM and ability to learn the local features

in the context of neighbor words

Experiments on two datasets showed improvements of

LSTM-CNN model for all punctuation types compared to

traditional LSTM model The overall F1scores were

im-proved by 1.91% and 1.82% on two datasets respectively

and the standard deviations of 20 run reduced significantly

These most significant improvements were achieved when

adding extra-loss on training phase

Several experiments were conducted to verify the

ra-tionality and effectiveness of the model’s components

and proposed materials The results also demonstrated the

robustness of our model that can automatically adapt to

different types of data from telephone speech to broadcast

news with different label schemata In addition, our

pro-posed model is scalable to perform well on the small (40

hours) or large (1,340 hours) corpora

The experiments also highlighted out the limitation of

our model about data imbalance problem We aim to

address this problem as well as further extensions of our

model in the future works Future research includes the

use of a richer set of prosodic input representations and

training a new English word embeddings model on a

verbal text dataset

ACKNOWLEDGMENT This research is supported by the National Research

Foundation Singapore under its AI Singapore Programme

[Award No.: AISG-100E-2018-006] We also thank the

anonymous reviewers for their comments and suggestions

REFERENCES [1] D A Jones, F Wolf, E Gibson, E Williams, E Fe-dorenko, D A Reynolds, and M Zissman, “Measuring the readability of automatic speech-to-text transcripts,” in Eighth European Conference on Speech Communication and Technology, 2003

[2] Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradient-based learning applied to document recognition,” Proceed-ings of the IEEE, vol 86, no 11, pp 2278–2324, 1998 [3] D E Rumelhart, G E Hinton, and R J Williams, “Learn-ing representations by back-propagat“Learn-ing errors,” nature, vol 323, no 6088, p 533, 1986

[4] S Hochreiter and J Schmidhuber, “Long short-term mem-ory,” Neural computation, vol 9, no 8, pp 1735–1780, 1997

[5] Y Kim, “Convolutional neural networks for sentence clas-sification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014,

pp 1746–1751

[6] Q Qian, M Huang, J Lei, and X Zhu, “Linguistically regularized lstm for sentiment classification,” in Proceed-ings of the 55th Annual Meeting of the Association for Computational Linguistics, vol 1, 2017, pp 1679–1689 [7] N Ueffing, M Bisani, and P Vozila, “Improved models for automatic punctuation prediction for spoken and written text.” in INTERSPEECH, 2013, pp 3097–3101

[8] X Zhang and Y LeCun, “Text understanding from scratch,” arXiv preprint arXiv:1502.01710, 2015

[9] A Stolcke et al., “Automatic detection of sentence bound-aries and disfluencies based on recognized words,” in Fifth International Conference on Spoken Language Processing, 1998

[10] X Wang, H T Ng, and K C Sim, “Dynamic conditional random fields for joint sentence boundary and punctuation prediction,” in INTERSPEECH, 2012

[11] X Che, C Wang, H Yang, and C Meinel, “Punctuation prediction for unsegmented transcript based on word vec-tor.” in LREC, 2016

[12] O Tilk and T Alum¨ae, “Lstm for punctuation restoration

in speech transcripts,” in INTERSPEECH, 2015

[13] C Xu, L Xie, G Huang, X Xiao, E S Chng, and H Li,

“A deep neural network approach for sentence boundary detection in broadcast news,” in INTERSPEECH, 2014 [14] K Xu, L Xie, and K Yao, “Investigating lstm for punctu-ation prediction,” in Internpunctu-ational Symposium on Chinese Spoken Language Processing IEEE, 2016, pp 1–5 [15] O Tilk and T Alum¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restora-tion.” in INTERSPEECH, 2016, pp 3047–3051

[16] H Q Le, D C Can, S T Vu, T H Dang, M T Pilehvar, and N Collier, “Large-scale exploration of neural relation classification architectures,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp 2266–2277

[17] N Kalchbrenner, E Grefenstette, and P Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol 1, 2014, pp 655–665 [18] P Bojanowski, E Grave, A Joulin, and T Mikolov, “En-riching word vectors with subword information,” Trans-actions of the Association for Computational Linguistics, vol 5, pp 135–146, 2017

[19] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” CoRR, 2014 [Online] Available: http://arxiv.org/abs/1412.6980

[20] P Bell et al., “The mgb challenge: Evaluating multi-genre broadcast media recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) IEEE, 2015, pp 687–693

Định dạng
Số trang	4
Dung lượng	0,91 MB