In this study, we focus on improving the performance of ITN tasks ap-by adopting the combination of neural network models and rule-based systems.Specifically, we first use a seq2seq mode
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Supervisor: Associate Professor Le Thanh Huong
Supervisor’s signature
School: Information and Communication Technology
May 15, 2023
Trang 2CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Phan Tuấn Anh
Đề tài luận văn:Cải thiện chất lượng cho bài toán chuẩn hóa ngược văn bản dựa trên mạng nơ ron và nhận diện thực thể số
Chuyên ngành:Khoa học dữ liệu và trí tuệ nhân tạo (Elitech)
Mã số SV: 20211263M
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 22/4/2023 với các nội dung sau:
1 Sửa lỗi chính tả, soát loại câu chữ, bố cục luận văn
2 Lược bỏ phần miêu tả trong lời chú thích của các hình ảnh, các bảng
10 Đưa ra nhận xét và giải thích tập dữ liệu tiếng Việt cho kết quả kém hơn tập dữ liệu tiếng Anh (trong phần 4.4.1.2)
11 Sửa lại tài liệu tham khảo (không dùng trích dẫn các bài báo ở arxiv)
Ngày tháng năm
CHỦ TỊCH HỘI ĐỒNG
Trang 3Graduation Thesis Assignment
Name: Phan Tuan Anh
Phone: +84355538467
Email: Anh.PT211263M@sis.hust.edu.vn; phantuananhkt2204k60@gmail.comClass: 21A-IT-KHDL-E
Affiliation: Hanoi University of Science and Technology
Phan Tuan Anh- hereby declare that this thesis on the topic ”Improving the ity of Inverse text normalization based on neural network and numerical entitiesrecognition” is my personal work, which is performed under the supervision ofAssociate Professor Le Thanh Huong All data used for analysis in the thesis are
qual-my own research, analysis objectively and honestly, with a clear origin, and havenot been published in any form I take full responsibility for any dishonesty inthe information used in this study.”
Student
Signature and Name
Trang 4I wish that a few lines of short text could help me convey my most sincere tude to my supervisor: Associate Professor Le Thanh Huong, who has driven andencouraged me throughout the 2 years of my master’s course She has listened
grati-to my idea and given me numerous valuable advice for my proposal Besides,she also indicated the downsides of my thesis, which are very helpful for me toperfect my thesis
I would like to thank Ph.D Bui Khac Hoai Nam and other members of the NLPteam in Viettel Cyberspace Center, always support and provide me with the foun-dation knowledge Especially, my leader Mr Nguyen Ngoc Dung always createsfavorable conditions for me to conduct extensive experiments in this study
Last but not least, I would like to thank my family, who play the most tant role in my life They are constantly my motivation to accept and pass thechallenge I have at this time
Trang 5Neural inverse text normalization (ITN) has recently become an emerging proach for automatic speech recognition in terms of post-processing for readabil-ity In particular, leveraging ITN by using neural network models has achievedremarkable results instead of relying on the accuracy of manual rules How-ever, ITN is a highly language-dependent task especially tricky in ambiguouslanguages In this study, we focus on improving the performance of ITN tasks
ap-by adopting the combination of neural network models and rule-based systems.Specifically, we first use a seq2seq model to detect numerical segments (e.g.,cardinals, ordinals, and date) of input sentences Then, detected segments areconverted into the written form using rule-based systems Technically, a majordifference in our method is that we only use neural network models to detectnumerical segments, which is able to deal with the low resource and ambiguousscenarios of target languages In addition, to further improve the quality of theproposed model, we also integrate a pre-trained language model: BERT and onevariant of BERT ( RecogNum-BERT) as initialize points for the parameters ofthe encoder
Regarding the experiment, we evaluate different languages: English and namese to indicate the advantages of the proposed method Accordingly, empir-ical evaluations provide promising results for our method compared with state-of-the-art models in this research field, especially in the case of low-resource andcomplex data scenarios
Viet-Student
Signature and Name
Trang 6TABLE OF CONTENTS
CHAPTER 1 Introduction 1
1.1 Research background 1
1.2 Research motivation 4
1.3 Research objective 5
1.4 Related publication 5
1.5 Thesis organization 6
CHAPTER 2 Literature Review 7
2.1 Related works 7
2.1.1 Rule-based methods 7
2.1.2 Neural network model 8
2.1.3 Hybrid model 11
2.2 Background 11
2.2.1 Encoder-decoder model 11
2.2.2 Transformer 15
2.2.3 BERT 18
Trang 7CHAPTER 3 Methodology 20
3.1 Baseline model 20
3.2 Proposed framework 20
3.3 Data creation process 23
3.4 Number recognizer 24
3.4.1 RNN-based and vanilla transformer-based 24
3.4.2 BERT-based 25
3.4.3 RecogNum-BERT-based 25
3.5 Number converter 28
CHAPTER 4 Experiment 30
4.1 Datasets 30
4.2 Hyper-parameter configurations 31
4.2.1 RNN-based and vanilla transformers-based configurations 31
4.2.2 BERT-based and RecogNum-BERT-based configurations 32
4.3 Evaluation metrics 33
4.3.1 Bi lingual evaluation understudy (BLEU) 33
4.3.2 Word error rate (WER) 33
4.3.3 Number precision (NP) 34
4.4 Result and Analysis 34
4.4.1 Experiments without pre-trained LM 35
4.4.2 Experiments with pre-trained LM 41
4.5 Visualization 43
Trang 8CHAPTER 5 Conclusion 44
5.1 Summary 44
5.2 Future work 45
Trang 9LIST OF FIGURES
1.1 The role of Inverse text normalization module in spoken dialoguesystems 1
2.1 The pipeline of NeMo toolkit for inverse text normalization 9
2.2 The overview of the encoder-decoder architecture for the example
is a machine translation (English→Vietnamese) 122.3 The overview of the using LSTM-based as encoder block (left)and the architecture of LSTM (right) 14
2.4 The illustration of the decoding process 15
2.5 The general architecture of vanilla Transformer, which is duced in [20] 16
intro-2.6 The description of scale dot product attention (left) and the head attention (right) 17
multi-2.7 The overview of pertaining produces of BERT that is trained inlarge corpus with the next sentence prediction and mask tokenprediction 18
3.1 The overview of my baseline model (the seq2seq model for ITNproblem) [7] 21
3.2 The general framework of the proposed method (hybrid model)for the Neural ITN approach 21
3.3 The overview of data creation pipeline for training Number ognizer 23
rec-3.4 The training process and inference process of applying Bert asinitializing encoder for the proposed model 26
Trang 103.5 The overview of the training and inference process of my posed model when integrating the RecogNum-BERT 27
pro-3.6 Our architecture for creating the RecogNum-Bert 28
3.7 The pipeline for data preparation for fine-tuning RecogNum-Bert 29
4.1 The comparison of models on English test set with BLEU score(higher is better) 36
4.2 The comparison of models on English test set with WER score(lower is better) 37
4.3 The comparison of models on English test set with NP score(higher is better) 38
4.4 The comparison of models on the Vietnamese test set with BLEUscore (higher is better) 39
4.5 The comparison of models on the Vietnamese test set with WERscore (lower is better) 40
4.6 The comparison of models on the Vietnamese test set with NPscore (higher is better) 41
Trang 11LIST OF TABLES
1.1 Examples of the ambiguous semantic problem in the Vietnameselanguage 3
3.1 An example output in Number converter 29
4.1 The training size, validation size, vocabulary size, and averagesequence length of the input of my datasets 31
4.2 The results of Number Recognizer in the validation set with BLEUscore 35
4.3 Comparison between variants of the proposed method with ferent encoders for the module number recognizer: TransformersBase, BERT, RecogNum-BERT in Bleu score 42
dif-4.4 Comparison between variants of the proposed method with ferent encoders for the number recognizer: vanilla transformer,BERT, RecogNum-BERT in WER score 42
dif-4.5 Examples for prediction error of the baseline model in English 43
4.6 Examples of error prediction of the proposed model in Vietnamese 43
Trang 12Notation Description
ASR Automatic Speech Recognition
BERT Bidirectional Encoder Representations from TransformersBi-LSTM Bidirectional Long short term memory
BLEU Bi Lingual Evaluation Understudy
CNN Convolutional neural network
end2end End to End
FST Finite state transducer
ITN Inverse text normalization
OOV out of vocabulary
POS Part of speech
seq2seq Sequence to Sequence
TN Text normalization
TTS Text To Speech
WER Word error rate
WFST Weight finite state transducer
Trang 13con-is lower and does not contain any punctuation or numerical tokens tially, ITN processes them to yield the text in written form In the written form,the numerical tokens are converted into a natural formation.
Consequen-Figure 1.1: The role of Inverse text normalization module in spoken dialoguesystems
Additionally, Text normalization (TN) is the inverse problem of ITN, which forms the text in written form into spoken form Despite being two opposite pro-cesses, ITN and TN have a close relationship, and many researchers use similar
Trang 14trans-approaches and techniques for dealing with them Nevertheless, different fromthe exploitation of promising methods for the TN problem in recent years [1],there are not many remarkable achievements for the ITN problem, which is re-garded as one of the most challenging NLP tasks.
The conventional approach for addressing ITN is rule-based systems For stance, Finite State Transducer (FST)-based models [2] have proved the compet-itive results [3] However, the major problem with this approach is the scalabilityproblem, which requires complex accurate transformation rules [4] Recently,Neural ITN has become an emerging issue in this research field, by exploiting thepower of neural networks (NN) for ITN tasks
in-Furthermore, due to the significant difference between written and spoken forms,handling numbers with minimal error is a central problem in this research field
In particular, as the way that humans handle numeric values, the models shouldwork well on both consecutive tasks such as recognizing the parts that belong tonumeric values and combining those parts to precise numbers
Specifically, NN-based models, typically seq2seq, have achieved high mances and become state-of-the-art models for the ITN problem [5][6] [7] Nev-ertheless, as I mention above, ITN is a highly language-dependent task and re-quires linguistic knowledge In this regard, the data-hungry problem (i.e., lowresource scenarios) is an open issue that needs to take into account for improv-ing performance For instance, in the shortage-data situation, models might lackinformation for the training stage in order to recognize and transform numericalsegments, which is the derivation of the bad performance of ITN
perfor-In this study, I take an investigation to improve the performance of Neural ITN
in terms of low resources and ambiguous scenarios Particularly, for formattingnumber problems, conventional seq2seq models might fail to generate sequen-tially character by character digit, which often appears in long numbers (e.g.,phone numbers, big cardinal) For example, the number ’one billion and eight’must be converted to 10 sequential characters: ’1 0 0 0 0 0 0 0 0 8’ Moreover,the poverty of data in the training process can cause this issue more worse in casethe considered languages have lots of ambiguous semantics between numbers andwords For instance, in Vietnamese, the word ’khˆong’ (English translate: no) can
be a digit ’0’, but also used to indicate a negative opinion Tab 1.1 illustrates eral examples of the ambiguous semantic problem in the Vietnamese language
sev-In this paper, the proposed framework includes two stages: i) sev-In the first stage,
I use a neural network model to detect numerical segments in each sentence ii)
Trang 15Spoken form (English translation) Number Word
tôi không thích cái bánh này (I do not like this cake) ✓
không là số tự nhiên nhỏ nhất -(zero is the smallest natural number) ✓
năm một nghìn chín trăm chín bảy (nineteen ninety-seven) ✓
năm mươi nghìn (fifty thousand) ✓
chín qủa táo (nine apples) ✓
Then, the output of the first phase is converted into the written form by using a
set of rules Accordingly, the main difference between my method compared to
previous works is that I only use the neural network to detect numerical segments
in each sentence as the first phase The reading number is processed in the
sec-ond stage by a set of rules, which is able to supply substantial information for
the system without requiring much data to learn as end-to-end models
Addition-ally, in the first stage of my pipeline, I implement the numerical detector as a
seq2seq model in which I also investigate the efficiency of several conventional
approaches: RNN-based, and transformer-based Besides, I also take advantage of
using a pre-trained language model (BERT and a variant of BERT:
RecogNum-BERT) for boosting the performance
Generally, the main contributions of my method are as follows:
• I propose a novel hybrid approach by combining a neural network with a
rule-based system, which is able to deal with ITN problems in terms of low
resources and ambiguous scenarios This is the first research to conduct
ex-periments and analyze this scenario of data
• I evaluate the proposed methods in two different languages such as English
and Vietnamese with promising results Specifically, my method can be
ex-tended easily to other languages without requiring any linguistics
grammat-ical knowledge
• I propose a novel approach for integrating the knowledge of the pre-trained
language model (BERT) to enhance the quality of my method
• I present a novel pipeline to build an ITN model for Vietnamese that is based
on a neural network
Trang 161.2 Research motivation
Regarding the research motivation, I have two remarkable points as follows:
• The scenario of data I would like to consider throughout this thesis is: lowresource and ambiguous data As I mention in detail later in chapter Ex-periment, for research purposes, almost previous researchers reuse the datafor TN problem [2] Essentially, because ITN and TN are opposite prob-lems, the authors reverse the order of each sample in the TN data set to turn
it into the sample for ITN Intuitively, I can suppose that there is no dard data set for ITN tasks Vietnamese also witness the same issue whenthere is any annotated data set for ITN Besides, building the data set forITN problem in any language cost considerably in workforce and time Thisphenomenon may lead to the limitation of the data set for both industry andacademic environments It is the biggest motivation for me to consider thescenario low resource data Additionally, the complexity of data also needs
stan-to be considered For several languages, including Vietnamese, ambiguitybecome a serious problem (as some example in the 1.1) This increases thedifficulty level of data that one model has to deal with More specially, bothconventional methods: using rule-based or using neural networks might havepoor performance when solving one complex language As a consequence,besides the poor resource scenario, I also have the motivation to consider thesecond scenario: ambiguous data
• As I mentioned above, both approaches: using a rule-based system and using
a neural network system might have obstacles when facing one low resourceand complex data Two aforementioned methods own particular downsides.Building a rule-based system is extremely complicated and require a greatdeal of knowledge about language expert In addition, the rule-based system
is limited in the ability to generalize, upgrade and extend For the ambiguouslanguage as Vietnamese, this problem is more serious and in some cases, therule-based system is fail Regarding the neural network, despite having thecapacity to deal with the generalization and ambiguity problem by learningthe contextual embedding, this approach requires a wealthy resource More-over, because there is no mechanism for controlling effectively the output,using one end2end model with a neural network also causes the issue: un-recoverable error The above reasons encourage us to invent a new method
to combine two approaches to harmonize both strengths of these methods as
Trang 17well as eliminate their drawbacks.
on is the numerical entities Besides, as I mentioned in the research motivationsection, I put my method under the scenario: of low resources and ambiguous dataand my method is the combination of two conventional approaches: rule-basedand neural network systems
Using the hybrid model, I want to prove the efficiency of my method in limitedand complex data Not only that, I also compare the performance of my model
to the baseline model in the wealthy resource to see the downside of my methodwhen the data is rich
Due to the effectiveness of transferring the knowledge from the pre-trained guage model to the downstream NLP tasks, I also would like to test whether thisstatement is suitable for the ITN problem Additionally, by inventing the novelvariant of BERT, I would like to test whether supplementing the additional in-formation about the appearance of numerical entities in sentences can boost theperformance or not
lan-Finally, via this thesis, I desire to boost the awareness of researchers about thisproblem as well as my scenarios I hope the presented method in my thesis ishelpful to apply for real production
1.4 Related publication
Phan, T A., Nguyen, N D., Thanh, H L., Bui, K H N (2022, December) ral Inverse Text Normalization with Numerical Recognition for Low ResourceScenarios In Intelligent Information and Database Systems: 14th Asian Con-ference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022,
Trang 18Neu-Proceedings, Part I (pp 582-594) Cham: Springer International Publishing.(Accepted)
Chapter 2 describes the literature review This chapter is further divided into 2subparts: related work and background In related work, I summarize the existingwork to deal with ITN and TN problems that are categorized into three mainapproaches The background sector provides the fundamental knowledge aboutthe backbone of my methodologies such as encoder-decoder architecture, LSTM-based, Transformer based seq2seq model, and overview of BERT
Chapter 3 reveals the detail of my methodologies In this chapter, I introducethe overview of the baseline model, the proposed model, the pipeline for creatingdata for seq2seq, the LSTM-based, transformer-based, BERT, and one variant
of BERT to apply for the encoder-decoder model In the end of this chapter, Idescribe the way I build the rule-base system for both English and Vietnamesethat plays a crucial role in the whole system
Chapter 4 firstly shows the important trait of my data, the detail of configurationsthat I conduct in my experiments in this thesis Besides, this chapter also provides
a thorough analysis and comparison between my method and the baseline
Chapter 5 summarizes my contributions to this thesis In this chapter, I also givesome interesting ideas that I am going to investigate in the future
Trang 19Kestrel, a component of the Google TTS synthesis system [2] that concentrates
on solving the TN problem Kestrel first categorizes the numerical entities intext into multi semiotic classes: measure, percentages, currency amount, dates,cardinal, ordinal, decimal, fraction, times, telephone numbers, and electronic ad-dresses Sequentially, they use a protocol buffer with the basic unit: message.One message is essentially a dictionary, which consists of named keys and cer-tain values The values might include integers, strings, booleans, or even othernested messages The whole process can be divided into 2 stages: classification
Trang 20and verbalization While the classifier is responsible for recognizing what otic class the corresponding token belongs to via WFST grammars, the verbalizewill receive the message and convert it into the right form In terms of evaluationresults, Kestrel achieves promising accuracy in both English and Russian Espe-cially, for both languages, their proposal reaches virtually absolute accuracy forseveral semiotic classes: cardinal, date, decimal, and electronic, and 99% onthe Google TN test data set [4].
semi-For ITN, the set of rules in [5] achieves 99% on internal data from a virtualassistant application Recently, in order to invent a new path for developing pro-duction seamlessly, Zhang et al introduce an open-source python WFST-basedlibrary [3] The illustration of the system can be described in figure 2.1 Com-pared it’s pipeline to Kestrel, this flow has somewhat similar to the way the authordefines the semiotic class, except that the process will be reversed Nemo ITNalso includes a two-stage normalization pipeline that first detects semiotic to-kens (classification) and then converts these to written form (verbalization) Bothstages consume a single WFST grammar The major problem is that this approachrequires significant effort and time to scale the system across languages Never-theless, by using the Python library Pynini to formulate and compile grammar,Nemo ITN can easily to add new rules, modify an existing class as well as add anentirely new semiotic class This is a huge advantage for deploying them into theproduction environment In terms of results, the Nemo toolkit obtains the exactmatch of 98.5% for CARDINAL and 78.65% for DECIMAL on the cleaned dataset
Overall, for using grammar rules such as WFST, we do not need an annotateddata set But this method cost significantly the language experts and leads to thechallenge to scale the model
2.1.2 Neural network model
Recurrent Neural Network (RNN)-based seq2seq models [8], have been adoptedfor reducing manual processes
For the TN problem, Sproat et al [9] consider this problem as a machine lation task and develop an RNN-based seq2seq model trained on window-baseddata Specifically, an input sentence is regarded as a sequence of characters, andthe output sentence is a sequence of words Furthermore, since the length ofthe sequence input problem, they split a sentence into chunks with a windowsize equal to three for creating sample training in which normalized tokens are
Trang 21trans-Figure 2.1: The pipeline of NeMo toolkit for inverse text normalization.
marked by distinctive begin tag< norm >and end tag< /norm > In this regard,this approach is able to limit the number of input and output nodes to somethingreasonable Their architecture neural network follows closely that of [10]
Sequentially, Sevinj et al [11] proposed a novel end-to-end Convolutional NeuralNetwork (CNN) architecture with residual connections for the TN task Particu-larly, they consider the TN problem as the classification problem which includestwo stages: i) First, the input sentence is segmented into chunks, which is similar
to [9] and use a CNN-based model to label each chunk into corresponding classbased on scores of soft-max function; ii) After the classification stage, they ap-ply rule-based methods depending on each class Rather than using the grammarrules, leveraging the CNN model with prove the efficiency of this method withapproximately 99.44% for accuracy over total semiotic classes
Mansfield et al in [12] firstly take advantage RNN-based seq2seq model fordealing with the TN problem Especially, both input and output are processedusing subword units to overcome the OOV issue The numerical tokens are eventokenized into the character levels In addition, the linguistic features such as:1) capitalization, upper, lower, mixed, non-alphanumerical, foreign characters.2) position: beginning, middle, end, singleton 3) POS tags 4) labels also aretaken into account for enhancing the quality of neural machine translation mod-els To integrate linguistic features with subword units, the concatenate or addoperator is utilized and the combined embedding is fed into the Bi-LSTM en-coder When compared with the window-size-based method as a baseline, the
Trang 22full model achieves the SER (sentence error rate) is only 0.78%, the WER (worderror rate) is 0.17% and the BLEU score is approximately 99.73%.
For the ITN problem, Sunkura et al in [7] also consider ITN under the form of
a machine translation task Inspire of subword tokenizer methods in [12], theyfirst tokenizer sentence using the Sentence piece toolkit [13], then feed the em-bedding feature to the encoder The output of the decoder is recovered throughseveral post-processing steps Their proposed architecture uses both RNN-basedand Transformer-based models for encoder-decoder with copy attention mech-anism in decoding The result comparisons show the best performance of theTransformer-based method in various domains of the test set by only 1.1% WERfor Wikipedia and 2.0%, 2.4%, 1.5% for CNN, Daily Mail, and News-C, re-spectively Additionally, by the impressive outcome of using neural networks forITN, the authors also consider take the pre-trained language model information toboost the performance They find out that using pre-trained models such as BART[14] as initialize for encoder and decoder does not create good performance whileusing BERT [15] to extract the contextual embedding and fuse it into each layer
of transformers as [16] can benefit minimally Beyond English, their model alsograbs the good WER score in German, Spanish, and Italian with 2.1%, 5%, and1.6%, respectively
In conclusion, the method that use a neural network model for dealing with both
TN and ITN tasks can help the model get rid of serious costs for language pert information as well as complex structures of the model These methods caneasily to expand for multiple languages and scale to huge systems Nevertheless,implementing these methods also face several challenges:
ex-• Training neural networks require numerous labeled data Practically, in most
of the research regarding ITN problem, to yield the pair-sample for training,the author uses the Google TN data set, and swaps input and output to createdata for ITN This phenomenon leads to concern about the quality of themodel under the poor resource scenario
• Because of using seq2seq as an end-to-end model, the neural networks cangenerate unrecoverable errors When handling the semiotic class as a num-ber of entities, the error can cause the severe problems in the empirical ap-plications
Trang 232.1.3 Hybrid model
Both using WFST and neural networks for ITN have downsides of their own.WFST-based methods strongly depend on the volume and accuracy of a set ofgrammar rule that language experts can provide, and in some cases, they arenot able to cover all situations Meanwhile, using neural network consume agreat of annotated data and suffer unrecoverable errors Therefore, many existingapproaches combine two aforementioned methods to overcome their weakness ofthem
Pusateri et al [5] presents a data-driven approach for ITN problems by a set
of simple rules and a few hand-crafted grammars to cast ITN as a labeling lem Then, a bi-directional LSTM model is adopted for solving the classificationproblem
prob-Sunkura et al [7] propose a hybrid approach combining transformer-based seq2seqmodels and FST-based text normalization techniques, where the output of theneural ITN model is passed through FST They use a confidence score that isemitted by the neural model to decide whether the system should use neural ITNoutput Intuitively, the confidence score can be considered as a filter to switch theoverall model to choose the output of FST rather than the neural model in case themodel encounters an unrecoverable error Basically, this approach is not indeedthe combination of two approaches because the final output is produced by one
of two models
2.2 Background
In this section, I introduce briefly several crucial pieces of knowledge related tothis work They include the encoder-decoder model, transformer architecture,and pre-trained language model as BERT All of them are the significant unit thatconstitutes my proposed model
2.2.1 Encoder-decoder model
The encoder-decoder model was first presented in [8] by Sutsekeve et al, ing to solve the seq2seq problem The seq2seq problem generally consists of anNLP task in which both input and output are sequences of tokens such as machinetranslation, abstractive summarization, text generation, and text normalization,
Trang 24aim-Intuitively, one encoder-decoder model is made of two main components: oneencoder and one decoder Each component is further constituted from smallerunits, called block: encoder-block and decoder block Each encoder block re-ceives the output of the previous block as input, try to capture information tocreate the hidden state, and forwards it to the following block The hidden statecan be understood as the code of information in the vector space Consecutively,the encoded information of the encoder block is passed through the decoder block
to decode sequentially the list of tokens In each step of decoding, the decoderblock must use both information from the previous block and from the encoderblock to predict the next tokens until obtaining particular conditions The finalresult is the combination of all predicted tokens Figure 2 shows the overview
of the encoder-decoder architecture for a machine translation problem (English
→ Vietnamese) In which, the source sentence in English is passed through theencoder that is contructed by multi-stacked encoder block The output of lowerblock is used as the input of the next one Finally, the encoder try to obtain thecontext vector by capturing the intra relation between the elements in the sourcesentence Sequentially, the context vector is fed in to the decoder, and create thefinal result in target sentence in Vietnamese
In the next part of the section, I will review detail about applying the Long term memory (LSTM) to the encoder-decoder model, which is presented in [8]
short-Figure 2.2: The overview of the encoder-decoder architecture for the example is
a machine translation (English→Vietnamese)
Encoder The LSTM was introduced in [17], that is Recurrent neural network chitecture By introducing three types of new gates: the input gate regulate the
Trang 25ar-amount of information into, forget gate decides how much information will bediscarded for the current step, the output gate controls the piece of information inthe current state to output, LSTM has the ability to deal with the vanishing gradi-entin conventional RNN model and capture effectively the information from thetime-series data With the LSTM as the encoder-block I denotes = (si)i=Ni=1 asthe input string Here,si indicate the tokenith In the encoder block numberlth,
xi is the embedding of token ith, that is one vector belongs to learnable matrixembedding, andh i is the hidden state of the model at time stepith The value of
h ican be computed base onh i−1,x ias follow:
inputi= σg(Winputxi+ Uinputhi−1 + binput) (2.2)
outputi= σg(Woutputxi+ Uoutputhi−1+ boutput) (2.3)
be used and diminished, respectively The symbolσhindicates thetanhfunction.Now, I have the hi as the output of token ith at the time step For the followinglayers (l+1),hli will become new input embedding of tokens ith: xl+1i = hli, andthe similar process will be operated until reaching the last block Finally, with oneencoder containing k blocks, I have the final hidden state arehk = hk1, hk2, , hkN.These vectors also are called context vectors The whole encoding process aims
to extract valuable information base on the characteristic of each token xi andthe knowledge about the structure of input: sequentiality Practically, instead ofonly using one-directional LSTM as above, researchers usually take advantage
of both-direction of sequence: Bi-directional LSTM for the encoder Essentially,Bi-directional LSTM (Bi-LSTM) has the same architecture as LSTM, apart fromthe information be combined by forward direction (from left to right) and back-ward direction(from right to left) The combination is operated by concatenationoperation:hBi−LST M =
h−−−−→
iThe overview of the encoder with mul-tiple blocks and integration of the LSTM model for building the encoder is given
in figure 2.3
Decoder Opposite to the functionality of the encoder, the decoder aims to
Trang 26trans-Figure 2.3: The overview of the using LSTM-based as encoder block (left) andthe architecture of LSTM (right).
form the context vectors from vector space into output Likewise, to the ture of the encoder block, the decode block is also easily implemented by usingLSTM The Bi-LSTM is not considered because the decoding process has onlyone direction: from left to right In the work in [8], the decoder only takes the lastvector of context vectors hkN as the initialization of the hidden state of the firststep rather than vector 0 as the encoder With the decoder containing a K block,the decoding process is performed sequentially as follows:
architec-• At time step j, to avoid the confusion, I denotehkN = hencode, the hidden state
of layerkthis computed fromhj−1 and the input of layer(k − 1)th: xk−1j :
Trang 27EN Dor reaches the limit of sequence length The overview of using LSTM forthe decoder is illustrated in figure 2.4.
Figure 2.4: The illustration of the decoding process
For enhancing the quality of the decoding process, [18] and [19] introduce ferent ways to apply attention mechanisms to improve the ability of alignment ofthe model Thanks to this mechanism, the decoder is able to decide to prefer whattoken in the encoder will have more contribution for predicting the next token
dif-2.2.2 Transformer
Recently, Transformer is a new seq2seq architecture, which has achieved highperformance for most NLP tasks [20] Accordingly, the work in [7] has shown theadvantage of the Transformer compared to RNN-based models in the ITN prob-lem Fig 2.5 depicts the general architecture of the Transformer Particularly, theTransformer reconstructed the encoder-decoder architecture that includes stackedself-attention and point-wise full connection in both the encoder and decoder
Encoder: The encoder consists of 6 stacked encoder layers The basic unit ofeach unit is a sub-layer or sub-block The first sub-block is multi-head self-attention, the other is a 2-layer feed-forward network Each sub-layer also isemployed a residual connection and one-layer normalization The dimension ofthe embedding vector is set asdmodel= 512
Decoder: The decoder also compose of 6 stacked decoder layers In each coder layer, the author adds an intermediate sub-layer, that allows the decodercan perform attention to the output of encoders Likewise to the encoder, the firstand last sub-layer still are multi-head self-attention and point-wise feed-forward
Trang 28de-Figure 2.5: The general architecture of vanilla Transformer, which is introduced
in [20]
layer The masking mechanism is performed effectively to prevent the wrongattention to the subsequent positions on the decoder side and to the positions ofpadding tokens on the encoder side
‘The biggest difference and also create the strength of the Transformer compared
to the RNN-based encoder-decoder is multi-head attention Particularly, by ducing three types of matrices: Query, Key Value, and scale dot-product attentionand this layer out the new hidden state of each token as the weighted sum of allconsidered tokens When it comes to self-attention, it allows one token has theability to decide which level of relevance between it and other tokens around it
intro-In another case, using the cross-attention mechanism, one token is under the coding process and also decides what tokens in the encoder are most relevant to
de-it The illustration of scale dot-production attention and multi-head attention aregiven as the figure 2.6
The detail formulation of scale dot-product attention layer at head ith can be
Trang 29Figure 2.6: The description of scale dot product attention (left) and the head attention (right).
Wherexiandx′i are the respective input and output of the attention layer WiQ ∈
Rd×dk , WiK ∈ R d×d k , WiV ∈ R d×d k are learnable parameters Q = xWiQ, K =
xWiK, andV = xWiV are the query matrix, key matrix, and value matrix of head
ith, respectively d anddk denote the hidden size of the model and headith Thescale factor√1
d k is proposed to get the model rid of too large or too small gradientsproblem
Using multi-head attention can benefit the model to jointly multiple types of formation at different positions:
in-x′ = M ulti − head(x, WQ, WK, WV) = Concate(head1, head2, , headk)WO
(2.14)
where headi= Attention(xi, WiQ, WiK, WiV)
(2.15)
Trang 30Sequentially, the log conditional probability can be interpreted as follows:
BERT stands for Bidirectional Encoder Representations from Transformers BERT
is designed as the architecture of the encoder of the Transformer, to pre-train onunannotated data, and has the ability to combine both left and right contexts inall layers As with other pre-trained language models, BERT was born to apply
to the downstream tasks of NLP BERT learns the contextual embedding of eachtoken base on two mechanisms: mask token prediction and next sentence predic-tion The overview of the training process for BERT is given as figure 2.7 Tofulfill two tasks simultaneously, BERT is required to investigate the contextualembedding of all tokens of the input sentence Due to being trained in a largecorpus, BERT is able to collect a huge knowledge of a particular language and bevery useful for enhancing the performance of downstream NLP tasks
Figure 2.7: The overview of pertaining produces of BERT that is trained in largecorpus with the next sentence prediction and mask token prediction
Here, I only focus on analyzing the mask token prediction In the training cess, there are a certain amount of tokens in the original input to be masked Inthe final layer, the hidden state of tokens at the corresponding position is passed