Our model profits from the advantage of two dominant deep learning architectures: i the ability to learn the long dependencies in both directions of a bidirectional Long Short-Term Memor
Trang 1A Hybrid Deep Learning Architecture for Sentence Unit Detection
Duy-Cat Can∗†, Thi-Nga Ho‡ and Eng-Siong Chng†‡
∗Faculty of Information Technology, University of Engineering and Technology, VNUH, Vietnam
†Temasek Laboratories@NTU, Nanyang Technological University, Singapore
‡School of Computer Science and Engineering, Nanyang Technological University, Singapore
catcd@vnu.edu.vn, ngaht@ntu.edu.sg, ASESChng@ntu.edu.sg
Abstract—Automatic speech recognition systems currently
deliver an unpunctuated sequence of words which is hard
to peruse for human and degrades the performance of the
downstream natural language processing tasks In this paper,
we propose a hybrid approach for Sentence Unit Detection,
in which the focus is on adding the full stop [.] to the
un-structured text Our model profits from the advantage of two
dominant deep learning architectures: (i) the ability to learn
the long dependencies in both directions of a bidirectional
Long Short-Term Memory; (ii) the ability to capture the
local context with Convolutional Neural Networks We also
empirically study the training objective of our networks using
extra-loss and further investigate the impacts of each model
component on the overall result Experiments conducted
on two large-scale datasets demonstrated that the proposed
architecture outperforms previous separated methods by a
substantial margin of 1.82-1.91% of F1
Availability: the source code and model are available at
https://github.com/catcd/LSTM-CNN-SUD
Keywords-Sentence Unit Detection; Punctuation; Recurrent
Neural Networks; Long Short-Term Memory; Convolutional
Neural Network;
I INTRODUCTION Recent years have witnessed tremendous progress in
automatic speech recognition (ASR) However, the text
transcript generated by current recognition systems is
still simply a stream of words without punctuation and
segmentation Generally, the human readability of the
tran-script can be comprehensively improved by the existence
of punctuation marks [1], and the segmentation of the
text based on punctuation positions will also increase the
accuracy of post-processing such as question answering or
machine translation In this work, we approach Sentence
Unit Detection (SUD) problem as a sequential tagging,
which aims to segment the sequence of words by labeling
the end of each sentence a full stop mark Besides, we
further try to predict fined-grain punctuation mark
In the last decade, deep learning methods have produced
state-of-the-art results in many tasks of natural language
processing (NLP) by manipulating multiple hidden layers
to learn the robust representations of data+ Two most
typ-ical deep neural networks (DNNs) are the Convolutional
Neural Network (CNN) [2] and the Recurrent Neural
Networks, or RNNs [3], with Long Short-Term Memory
(LSTM) unit [4] CNN is good at capturing n-gram
fea-tures in the flat structure and has been proved effective in
NLP [5] RNN performs effectively on sequential data, and
it furthermore has many different improvements among
several state-of-the-art NLP systems [6]
In this work, we present an analysis of a neural architec-ture that takes advantage of these two DNNs The hybrid model benefits from far context capturing ability of a multilayer bidirectional LSTM and local features extracted
by the CNN
Compared with the prior works, the contributions of our work can be summarized as follows:
• We propose a hybrid model of LSTM-CNN and show that our model is effective in detecting sentence units and punctuations on two corpora
• We demonstrate the effectiveness of our proposed extra-loss on training the proposed model
• We try to handle the class imbalance problem by using weighted Cross-Entropy
II RELATED WORK There has been a considerable amount of efforts on constructing computational models to detect the sentence unit in unpunctuated text Most of the recent works can
be divided into two categories: hand-crafted feature based and automatic features extracted methods
Previous researches frequently used lexical features include bag-of-words or n-grams model [7] These tech-niques have been compared with ConvNets by Zhang and LeCun [8] When training, some state-of-the-art SUD approaches chose decision tree [9] or Conditional Random Fields (CRFs) [10] as the classifier These approaches only take traditional lexical features as input which depend on expert knowledge, thus normally cannot generalize data well enough because of expensive human exertion
In recent years, many studies attempt other possibilities
by using the word embeddings [11] with various DNN architectures to learn the features without prior knowledge Che et al [11] applied a CNN model on purely lexical, with pre-trained word vectors as the only input Study of Tilk and Alum¨ae [12] presented a two-stage RNN based model using LSTM units, showed the performance of LSTM on SUD task Many enhancements and techniques are also used to improve LSTM, such as CRF on the LSTM output [13, 14] or attention mechanism [15] Meanwhile, some recent experiments have shown the effectiveness of stacking the CNN on the output of LSTM
in the task of relation classification or sentiment analysis [16, 17] Inspired by these experiments, we attempt this combination in a different way for sentence unit detection
in this paper
Trang 2Figure 1 An overview of proposed model
III PROPOSE MODEL Figure 1 depicts the overall architecture of our
pro-posed model Given the unpunctuated text as input, it is
passed through an embedding vector generation layer for
our neural network Along the sequence of words, two
recurrent neural networks with Long Short-Term Memory
units are applied to learn hidden representations of words
in the embedded space respectively A convolution layer
is also applied to capture local features from words and
their neighbors
We have a multi-softmax layer on the output of the
previous phase for classification During the training stage,
the hidden states of LSTM and local features from CNN
are concatenated, and a fine-grained softmax layer is
followed to perform a (K + 1)-class classification
Addi-tionally, two coarse-grained softmax classifiers of LSTM
and CNN are also used to perform binary classifications
The final (K + 1)-class distribution is the (K + 1)-class
distribution provided by fine-grained classifier during the
testing stage The details of each layer are described below
A Embeddings
In the Embeddings layer, each word in the input
se-quence is transformed into a vector by looking up the
embedding matrix We∈ Rd ×|V |, where d is the dimension
of a vector, and V is a vocabulary of all words we consider
The embedding matrix is generated by using a pre-trained word embedding model which learned the word vectors captured hidden information about a language, such as word analogies or semantic, based on its external context For this paper, we use the fastText word embed-ding model [18] which is trained on Wikipedia data
B Features extraction 1) Multilayer Bidirectional Long Short-Term Memory:
To take advantage of sequential data, we make use of Recurrent Neural Network [3] with Long Short-Term Memory unit [4], which is demonstrated the effective-ness in capturing the long-term dependencies A common LSTM unit is composed of four components: a memory cell ct, an input gate it, an output gate ot, and a forget gate ft The hidden state ht is calculated using current input xtwith previous hidden state ht −1and memory cell
ct −1, as follow:
it= σ (Wixt+ Uiht−1+ bi) (1)
gt= tanh (Wcxt+ Ucht −1+ bc) (2)
ft= σ (Wfxt+ Ufht−1+ bf) (3)
ct= it◦ gt+ ft◦ ct −1 (4)
ot= σ (Woxt+ Uoht −1+ bo) (5)
In which, σ denotes the sigmoid function, and ◦ denotes the entry-wise product
2) Local features with Convolutional Neural Network:
To improve the performance of LSTM model, we use a CNN [2] layer to capture the context features around each word We use several filter’s region sizes for this CNN layer which allow CNN model to capture wider ranges of n-grams Local features ltfor the tth word in the context
of 2n neighbors can be extracted, utilizing convolution filter size d × (2n + 1) I.e,
lt= f (Wconvxt−n:t+n+ bconv) (7) where Wconv is the weight matrix for the convolution layer; bconv is bias for the hidden state vector; xt−n:t+n
is stack of 2n + 1 word vectors from (t − n) to (t + n);
f is a non-linear activation function
C Multi-softmax classifier
A fine-grained softmax classifier is used to predict a (K + 1)-class distribution ytfor each word,
yt= softmax (Wf[ht⊕ lt] + bf) (8) where Wf is the transformation matrix, and bf is the bias vector Fine-grained classifier makes use of represen-tation with bidirectional information combined with local features This (K +1)-class distribution then become final prediction on the decoding phase
Two coarse-grained softmax classifiers are applied to
htand ltseparately with linear transformation to give the binary distribution yh
t and yl
t respectively, i.e,
yht = softmax (Whht+ bh) (9)
ylt= softmax (Wllt+ bl) (10)
Trang 3Table I
S UMMARY OF TWO BENCHMARK DATASETS
Train Dev Test Train Dev Example 5359 359 275 284436 6433
Non-punctuated 299693 18454 12808 5722741 112190
Full stop [.] 25573 1508 1162 582004 12368
Comma [,] - - - 384732 8519
Question mark [?] - - - 109757 2714
Exclamation mark [!] - - - 71598 2061
Three dots [ ] - - - 30607 918
where Wc is the transformation matrix, and bc is
the bias vector Classifying tth word into two
coarse-classes (punctuated vs non-punctuated) can strengthen the
model’s ability to judge the fine-class
D Objective function and learning method
The two binary softmax classifiers are used to estimate
the probability that word is punctuation The (K +
1)-class softmax 1)-classifier is used to estimate the probability
of which punctuation type that word belongs to For a
single word in a data sample, the training objective is the
penalized cross-entropy of three classifiers, given by
L =−
K
X
i=0
utilog yti−
1 X i=0
vtilog ytih−
1 X i=0
vtilog ytil+λkθk2
(11) where ut ∈ R(K+1), vt ∈ R2, indicating the one-hot
represented ground truth; θ is the set of model parameters
to be learned, and λ is a regularization coefficient The
model parameters θ can be efficiently computed via
back-propagation through neural network structures To
mini-mize L, we apply mini-batch gradient descent with Adam
optimizer [19] in our experiments
IV EXPERIMENTAL EVALUATION
A Datasets
We evaluate our LSTM-CNN model on two benchmark
datasets: RT-03-041and the subset of MGB Challenge data
[20] The details of two datasets are shown in Table I
The RT-03-04 corpus consists of transcripts and
anno-tations of 40 of hours English Broadcast News (BN) and
Conversational Telephone Speech (CTS) audio data For
this dataset, we predict a label full stop [.] for each word
that is at the end of a full sentence The model is
fine-tuned using training set with validation on development
set, and the results are reported on the test set, which is
kept secret with the model
Our subset of MGB data includes approximately 1,340
hours over 1,600 hours of broadcast audio taken from four
BBC TV channels on seven weeks Since the provided
dataset does not contain punctuation marks, we use the
original subtitles and preprocessed subtitles to obtain the
correct boundaries for each unit In these experiments, we
predicted a fine-class label for each word, include full stop
[.], comma [,], question mark [?], exclamation mark
[!]and three dots [ ] With MGB, we separate ten
percent of training set for validation and report the result
on development set
1 MDE Training Data Speech LDC2004S08 and LDC2005S16.
Table II
R ESULTS ON RT-03-04 DATASET
CNN 82.28 52.45 64.06( + − 1.54) LSTM 81.92 66.70 73.53( + − 1.02) LSTM-CNN 79.56 70.36 74.68( + − 0.36) LSTM-CNN + 81.47 70.24 75.44( + − 0.27)
Table III
R ESULTS ON MGB DATASET
CNN 67.88 35.62 46.72( + − 0.47) 36.39( + − 0.61) LSTM 65.62 58.73 61.98( + − 0.09) 45.19( + − 0.61) LSTM-CNN 65.20 60.70 62.87( + − 0.11) 48.09( + − 0.25) LSTM-CNN + 67.65 60.6 63.80( + − 0.04) 49.49( + − 0.36)
Table IV
F 1 OF EACH LABEL ON MGB DATASET
CNN 59.91 38.52 34.99 29.04 19.50 LSTM 74.89 56.86 71.21 19.66 3.35 LSTM-CNN 74.45 57.96 69.78 27.99 10.25 LSTM-CNN + 74.91 59.21 70.87 29.49 12.96
B Performance of the LSTM-CNN model
We conduct the training and testing process 20 times and calculate the averaged results For evaluation, the predicted labels were compared to the golden annotated data using standard precision (P ), recall (R), and F1score metrics
Table II and III show the performance of our proposed model with different variants on two benchmark datasets
In both RT-03-04 and MGB datasets, the ensemble LSTM-CNN model outperforms all uni-models using LSTM-CNN or LSTM only
On RT-03-04, the local features extracted from CNN help to increase the recall of the LSTM model by 3.66%; the F1 is increased by 1.15% The result on the MGB dataset is similar, the recall and the micro-average F1 are increased by 1.97% and 0.89% respectively The standard deviations of 20 runs on two datasets are 0.36 on RT-03-04 and 0.11 on MGB (micro-average)
In addition, applying the extra-loss using two coarse-grained softmax to final training objective helps to boost
F1 0.76% and 0.93% on two datasets respectively Our LSTM-CNN+model is more stable and outperforms other models by a large margin
Table IV compares the result of each class on the MGB dataset The LSTM model with long dependency infor-mation fails to predict minor classes, such as three dots,
in the testing dataset Under other conditions, the CNN model, with local features, able to capture the features to resolve this problem Therefore, the out-performance of the proposed combined model in detecting minor classes
is remarkable and understandable
C Handle data imbalance Since we find out that all models achieve much higher precision than recall, we make one additional adjustment
to our better performers on RT-03-04: reduce the contri-bution to objective function for class “Non-punctuated”
Trang 4Figure 2 Investigate the impact of weighted loss on the RT-03-04
dataset using LSTM-CNN + model
By this effort, more border cases would be classed as
“Punctuated”, which balanced the precision and recall
Figure 2 shows that with the increase of the ratio for class
“Non-punctuated”, the precision and recall increase and
decrease respectively The best-achieved result is 76.68%
of F1 (P = 75.57%, R = 77.82%) with the ratio of two
classes are 0.35-0.65
V CONCLUSION
In this paper, we have presented a novel sentence
unit detection model that consists of two dominant deep
learning networks The proposed model takes advantage of
the ability to capture long-range dependencies on multiple
time scales of LSTM and ability to learn the local features
in the context of neighbor words
Experiments on two datasets showed improvements of
LSTM-CNN model for all punctuation types compared to
traditional LSTM model The overall F1scores were
im-proved by 1.91% and 1.82% on two datasets respectively
and the standard deviations of 20 run reduced significantly
These most significant improvements were achieved when
adding extra-loss on training phase
Several experiments were conducted to verify the
ra-tionality and effectiveness of the model’s components
and proposed materials The results also demonstrated the
robustness of our model that can automatically adapt to
different types of data from telephone speech to broadcast
news with different label schemata In addition, our
pro-posed model is scalable to perform well on the small (40
hours) or large (1,340 hours) corpora
The experiments also highlighted out the limitation of
our model about data imbalance problem We aim to
address this problem as well as further extensions of our
model in the future works Future research includes the
use of a richer set of prosodic input representations and
training a new English word embeddings model on a
verbal text dataset
ACKNOWLEDGMENT This research is supported by the National Research
Foundation Singapore under its AI Singapore Programme
[Award No.: AISG-100E-2018-006] We also thank the
anonymous reviewers for their comments and suggestions
REFERENCES [1] D A Jones, F Wolf, E Gibson, E Williams, E Fe-dorenko, D A Reynolds, and M Zissman, “Measuring the readability of automatic speech-to-text transcripts,” in Eighth European Conference on Speech Communication and Technology, 2003
[2] Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradient-based learning applied to document recognition,” Proceed-ings of the IEEE, vol 86, no 11, pp 2278–2324, 1998 [3] D E Rumelhart, G E Hinton, and R J Williams, “Learn-ing representations by back-propagat“Learn-ing errors,” nature, vol 323, no 6088, p 533, 1986
[4] S Hochreiter and J Schmidhuber, “Long short-term mem-ory,” Neural computation, vol 9, no 8, pp 1735–1780, 1997
[5] Y Kim, “Convolutional neural networks for sentence clas-sification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014,
pp 1746–1751
[6] Q Qian, M Huang, J Lei, and X Zhu, “Linguistically regularized lstm for sentiment classification,” in Proceed-ings of the 55th Annual Meeting of the Association for Computational Linguistics, vol 1, 2017, pp 1679–1689 [7] N Ueffing, M Bisani, and P Vozila, “Improved models for automatic punctuation prediction for spoken and written text.” in INTERSPEECH, 2013, pp 3097–3101
[8] X Zhang and Y LeCun, “Text understanding from scratch,” arXiv preprint arXiv:1502.01710, 2015
[9] A Stolcke et al., “Automatic detection of sentence bound-aries and disfluencies based on recognized words,” in Fifth International Conference on Spoken Language Processing, 1998
[10] X Wang, H T Ng, and K C Sim, “Dynamic conditional random fields for joint sentence boundary and punctuation prediction,” in INTERSPEECH, 2012
[11] X Che, C Wang, H Yang, and C Meinel, “Punctuation prediction for unsegmented transcript based on word vec-tor.” in LREC, 2016
[12] O Tilk and T Alum¨ae, “Lstm for punctuation restoration
in speech transcripts,” in INTERSPEECH, 2015
[13] C Xu, L Xie, G Huang, X Xiao, E S Chng, and H Li,
“A deep neural network approach for sentence boundary detection in broadcast news,” in INTERSPEECH, 2014 [14] K Xu, L Xie, and K Yao, “Investigating lstm for punctu-ation prediction,” in Internpunctu-ational Symposium on Chinese Spoken Language Processing IEEE, 2016, pp 1–5 [15] O Tilk and T Alum¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restora-tion.” in INTERSPEECH, 2016, pp 3047–3051
[16] H Q Le, D C Can, S T Vu, T H Dang, M T Pilehvar, and N Collier, “Large-scale exploration of neural relation classification architectures,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp 2266–2277
[17] N Kalchbrenner, E Grefenstette, and P Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol 1, 2014, pp 655–665 [18] P Bojanowski, E Grave, A Joulin, and T Mikolov, “En-riching word vectors with subword information,” Trans-actions of the Association for Computational Linguistics, vol 5, pp 135–146, 2017
[19] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” CoRR, 2014 [Online] Available: http://arxiv.org/abs/1412.6980
[20] P Bell et al., “The mgb challenge: Evaluating multi-genre broadcast media recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) IEEE, 2015, pp 687–693