Improving the quality of inverse text normalization based on neural network and numerical entities recognition

In this study, we focus on improving the performance of ITN tasks ap-by adopting the combination of neural network models and rule-based systems.Specifically, we first use a seq2seq mode

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Supervisor: Associate Professor Le Thanh Huong

Supervisor’s signature

School: Information and Communication Technology

May 15, 2023

Trang 2

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập – Tự do – Hạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn: Phan Tuấn Anh

Đề tài luận văn:Cải thiện chất lượng cho bài toán chuẩn hóa ngược văn bản dựa trên mạng nơ ron và nhận diện thực thể số

Chuyên ngành:Khoa học dữ liệu và trí tuệ nhân tạo (Elitech)

Mã số SV: 20211263M

Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 22/4/2023 với các nội dung sau:

1 Sửa lỗi chính tả, soát loại câu chữ, bố cục luận văn

2 Lược bỏ phần miêu tả trong lời chú thích của các hình ảnh, các bảng

10 Đưa ra nhận xét và giải thích tập dữ liệu tiếng Việt cho kết quả kém hơn tập dữ liệu tiếng Anh (trong phần 4.4.1.2)

11 Sửa lại tài liệu tham khảo (không dùng trích dẫn các bài báo ở arxiv)

Ngày tháng năm

CHỦ TỊCH HỘI ĐỒNG

Trang 3

Graduation Thesis Assignment

Name: Phan Tuan Anh

Phone: +84355538467

Email: Anh.PT211263M@sis.hust.edu.vn; phantuananhkt2204k60@gmail.comClass: 21A-IT-KHDL-E

Affiliation: Hanoi University of Science and Technology

Phan Tuan Anh- hereby declare that this thesis on the topic ”Improving the ity of Inverse text normalization based on neural network and numerical entitiesrecognition” is my personal work, which is performed under the supervision ofAssociate Professor Le Thanh Huong All data used for analysis in the thesis are

qual-my own research, analysis objectively and honestly, with a clear origin, and havenot been published in any form I take full responsibility for any dishonesty inthe information used in this study.”

Student

Signature and Name

Trang 4

I wish that a few lines of short text could help me convey my most sincere tude to my supervisor: Associate Professor Le Thanh Huong, who has driven andencouraged me throughout the 2 years of my master’s course She has listened

grati-to my idea and given me numerous valuable advice for my proposal Besides,she also indicated the downsides of my thesis, which are very helpful for me toperfect my thesis

I would like to thank Ph.D Bui Khac Hoai Nam and other members of the NLPteam in Viettel Cyberspace Center, always support and provide me with the foun-dation knowledge Especially, my leader Mr Nguyen Ngoc Dung always createsfavorable conditions for me to conduct extensive experiments in this study

Last but not least, I would like to thank my family, who play the most tant role in my life They are constantly my motivation to accept and pass thechallenge I have at this time

Trang 5

Neural inverse text normalization (ITN) has recently become an emerging proach for automatic speech recognition in terms of post-processing for readabil-ity In particular, leveraging ITN by using neural network models has achievedremarkable results instead of relying on the accuracy of manual rules How-ever, ITN is a highly language-dependent task especially tricky in ambiguouslanguages In this study, we focus on improving the performance of ITN tasks

ap-by adopting the combination of neural network models and rule-based systems.Specifically, we first use a seq2seq model to detect numerical segments (e.g.,cardinals, ordinals, and date) of input sentences Then, detected segments areconverted into the written form using rule-based systems Technically, a majordifference in our method is that we only use neural network models to detectnumerical segments, which is able to deal with the low resource and ambiguousscenarios of target languages In addition, to further improve the quality of theproposed model, we also integrate a pre-trained language model: BERT and onevariant of BERT ( RecogNum-BERT) as initialize points for the parameters ofthe encoder

Regarding the experiment, we evaluate different languages: English and namese to indicate the advantages of the proposed method Accordingly, empir-ical evaluations provide promising results for our method compared with state-of-the-art models in this research field, especially in the case of low-resource andcomplex data scenarios

Viet-Student

Signature and Name

Trang 6

TABLE OF CONTENTS

CHAPTER 1 Introduction 1

1.1 Research background 1

1.2 Research motivation 4

1.3 Research objective 5

1.4 Related publication 5

1.5 Thesis organization 6

CHAPTER 2 Literature Review 7

2.1 Related works 7

2.1.1 Rule-based methods 7

2.1.2 Neural network model 8

2.1.3 Hybrid model 11

2.2 Background 11

2.2.1 Encoder-decoder model 11

2.2.2 Transformer 15

2.2.3 BERT 18

Trang 7

CHAPTER 3 Methodology 20

3.1 Baseline model 20

3.2 Proposed framework 20

3.3 Data creation process 23

3.4 Number recognizer 24

3.4.1 RNN-based and vanilla transformer-based 24

3.4.2 BERT-based 25

3.4.3 RecogNum-BERT-based 25

3.5 Number converter 28

CHAPTER 4 Experiment 30

4.1 Datasets 30

4.2 Hyper-parameter configurations 31

4.2.1 RNN-based and vanilla transformers-based configurations 31

4.2.2 BERT-based and RecogNum-BERT-based configurations 32

4.3 Evaluation metrics 33

4.3.1 Bi lingual evaluation understudy (BLEU) 33

4.3.2 Word error rate (WER) 33

4.3.3 Number precision (NP) 34

4.4 Result and Analysis 34

4.4.1 Experiments without pre-trained LM 35

4.4.2 Experiments with pre-trained LM 41

4.5 Visualization 43

Trang 8

CHAPTER 5 Conclusion 44

5.1 Summary 44

5.2 Future work 45

Trang 9

LIST OF FIGURES

1.1 The role of Inverse text normalization module in spoken dialoguesystems 1

2.1 The pipeline of NeMo toolkit for inverse text normalization 9

2.2 The overview of the encoder-decoder architecture for the example

is a machine translation (English→Vietnamese) 122.3 The overview of the using LSTM-based as encoder block (left)and the architecture of LSTM (right) 14

2.4 The illustration of the decoding process 15

2.5 The general architecture of vanilla Transformer, which is duced in [20] 16

intro-2.6 The description of scale dot product attention (left) and the head attention (right) 17

multi-2.7 The overview of pertaining produces of BERT that is trained inlarge corpus with the next sentence prediction and mask tokenprediction 18

3.1 The overview of my baseline model (the seq2seq model for ITNproblem) [7] 21

3.2 The general framework of the proposed method (hybrid model)for the Neural ITN approach 21

3.3 The overview of data creation pipeline for training Number ognizer 23

rec-3.4 The training process and inference process of applying Bert asinitializing encoder for the proposed model 26

Trang 10

3.5 The overview of the training and inference process of my posed model when integrating the RecogNum-BERT 27

pro-3.6 Our architecture for creating the RecogNum-Bert 28

3.7 The pipeline for data preparation for fine-tuning RecogNum-Bert 29

4.1 The comparison of models on English test set with BLEU score(higher is better) 36

4.2 The comparison of models on English test set with WER score(lower is better) 37

4.3 The comparison of models on English test set with NP score(higher is better) 38

4.4 The comparison of models on the Vietnamese test set with BLEUscore (higher is better) 39

4.5 The comparison of models on the Vietnamese test set with WERscore (lower is better) 40

4.6 The comparison of models on the Vietnamese test set with NPscore (higher is better) 41

Trang 11

LIST OF TABLES

1.1 Examples of the ambiguous semantic problem in the Vietnameselanguage 3

3.1 An example output in Number converter 29

4.1 The training size, validation size, vocabulary size, and averagesequence length of the input of my datasets 31

4.2 The results of Number Recognizer in the validation set with BLEUscore 35

4.3 Comparison between variants of the proposed method with ferent encoders for the module number recognizer: TransformersBase, BERT, RecogNum-BERT in Bleu score 42

dif-4.4 Comparison between variants of the proposed method with ferent encoders for the number recognizer: vanilla transformer,BERT, RecogNum-BERT in WER score 42

dif-4.5 Examples for prediction error of the baseline model in English 43

4.6 Examples of error prediction of the proposed model in Vietnamese 43

Trang 12

Notation Description

ASR Automatic Speech Recognition

BERT Bidirectional Encoder Representations from TransformersBi-LSTM Bidirectional Long short term memory

BLEU Bi Lingual Evaluation Understudy

CNN Convolutional neural network

end2end End to End

FST Finite state transducer

ITN Inverse text normalization

OOV out of vocabulary

POS Part of speech

seq2seq Sequence to Sequence

TN Text normalization

TTS Text To Speech

WER Word error rate

WFST Weight finite state transducer

Trang 13

con-is lower and does not contain any punctuation or numerical tokens tially, ITN processes them to yield the text in written form In the written form,the numerical tokens are converted into a natural formation.

Consequen-Figure 1.1: The role of Inverse text normalization module in spoken dialoguesystems

Additionally, Text normalization (TN) is the inverse problem of ITN, which forms the text in written form into spoken form Despite being two opposite pro-cesses, ITN and TN have a close relationship, and many researchers use similar

Trang 14

trans-approaches and techniques for dealing with them Nevertheless, different fromthe exploitation of promising methods for the TN problem in recent years [1],there are not many remarkable achievements for the ITN problem, which is re-garded as one of the most challenging NLP tasks.

The conventional approach for addressing ITN is rule-based systems For stance, Finite State Transducer (FST)-based models [2] have proved the compet-itive results [3] However, the major problem with this approach is the scalabilityproblem, which requires complex accurate transformation rules [4] Recently,Neural ITN has become an emerging issue in this research field, by exploiting thepower of neural networks (NN) for ITN tasks

in-Furthermore, due to the significant difference between written and spoken forms,handling numbers with minimal error is a central problem in this research field

In particular, as the way that humans handle numeric values, the models shouldwork well on both consecutive tasks such as recognizing the parts that belong tonumeric values and combining those parts to precise numbers

Specifically, NN-based models, typically seq2seq, have achieved high mances and become state-of-the-art models for the ITN problem [5][6] [7] Nev-ertheless, as I mention above, ITN is a highly language-dependent task and re-quires linguistic knowledge In this regard, the data-hungry problem (i.e., lowresource scenarios) is an open issue that needs to take into account for improv-ing performance For instance, in the shortage-data situation, models might lackinformation for the training stage in order to recognize and transform numericalsegments, which is the derivation of the bad performance of ITN

perfor-In this study, I take an investigation to improve the performance of Neural ITN

in terms of low resources and ambiguous scenarios Particularly, for formattingnumber problems, conventional seq2seq models might fail to generate sequen-tially character by character digit, which often appears in long numbers (e.g.,phone numbers, big cardinal) For example, the number ’one billion and eight’must be converted to 10 sequential characters: ’1 0 0 0 0 0 0 0 0 8’ Moreover,the poverty of data in the training process can cause this issue more worse in casethe considered languages have lots of ambiguous semantics between numbers andwords For instance, in Vietnamese, the word ’khˆong’ (English translate: no) can

be a digit ’0’, but also used to indicate a negative opinion Tab 1.1 illustrates eral examples of the ambiguous semantic problem in the Vietnamese language

sev-In this paper, the proposed framework includes two stages: i) sev-In the first stage,

I use a neural network model to detect numerical segments in each sentence ii)

Trang 15

Spoken form (English translation) Number Word

tôi không thích cái bánh này (I do not like this cake) ✓

không là số tự nhiên nhỏ nhất -(zero is the smallest natural number) ✓

năm một nghìn chín trăm chín bảy (nineteen ninety-seven) ✓

năm mươi nghìn (fifty thousand) ✓

chín qủa táo (nine apples) ✓

Then, the output of the first phase is converted into the written form by using a

set of rules Accordingly, the main difference between my method compared to

previous works is that I only use the neural network to detect numerical segments

in each sentence as the first phase The reading number is processed in the

sec-ond stage by a set of rules, which is able to supply substantial information for

the system without requiring much data to learn as end-to-end models

Addition-ally, in the first stage of my pipeline, I implement the numerical detector as a

seq2seq model in which I also investigate the efficiency of several conventional

approaches: RNN-based, and transformer-based Besides, I also take advantage of

using a pre-trained language model (BERT and a variant of BERT:

RecogNum-BERT) for boosting the performance

Generally, the main contributions of my method are as follows:

• I propose a novel hybrid approach by combining a neural network with a

rule-based system, which is able to deal with ITN problems in terms of low

resources and ambiguous scenarios This is the first research to conduct

ex-periments and analyze this scenario of data

• I evaluate the proposed methods in two different languages such as English

and Vietnamese with promising results Specifically, my method can be

ex-tended easily to other languages without requiring any linguistics

grammat-ical knowledge

• I propose a novel approach for integrating the knowledge of the pre-trained

language model (BERT) to enhance the quality of my method

• I present a novel pipeline to build an ITN model for Vietnamese that is based

on a neural network

Trang 16

1.2 Research motivation

Regarding the research motivation, I have two remarkable points as follows:

• The scenario of data I would like to consider throughout this thesis is: lowresource and ambiguous data As I mention in detail later in chapter Ex-periment, for research purposes, almost previous researchers reuse the datafor TN problem [2] Essentially, because ITN and TN are opposite prob-lems, the authors reverse the order of each sample in the TN data set to turn

it into the sample for ITN Intuitively, I can suppose that there is no dard data set for ITN tasks Vietnamese also witness the same issue whenthere is any annotated data set for ITN Besides, building the data set forITN problem in any language cost considerably in workforce and time Thisphenomenon may lead to the limitation of the data set for both industry andacademic environments It is the biggest motivation for me to consider thescenario low resource data Additionally, the complexity of data also needs

stan-to be considered For several languages, including Vietnamese, ambiguitybecome a serious problem (as some example in the 1.1) This increases thedifficulty level of data that one model has to deal with More specially, bothconventional methods: using rule-based or using neural networks might havepoor performance when solving one complex language As a consequence,besides the poor resource scenario, I also have the motivation to consider thesecond scenario: ambiguous data

• As I mentioned above, both approaches: using a rule-based system and using

a neural network system might have obstacles when facing one low resourceand complex data Two aforementioned methods own particular downsides.Building a rule-based system is extremely complicated and require a greatdeal of knowledge about language expert In addition, the rule-based system

is limited in the ability to generalize, upgrade and extend For the ambiguouslanguage as Vietnamese, this problem is more serious and in some cases, therule-based system is fail Regarding the neural network, despite having thecapacity to deal with the generalization and ambiguity problem by learningthe contextual embedding, this approach requires a wealthy resource More-over, because there is no mechanism for controlling effectively the output,using one end2end model with a neural network also causes the issue: un-recoverable error The above reasons encourage us to invent a new method

to combine two approaches to harmonize both strengths of these methods as

Trang 17

well as eliminate their drawbacks.

on is the numerical entities Besides, as I mentioned in the research motivationsection, I put my method under the scenario: of low resources and ambiguous dataand my method is the combination of two conventional approaches: rule-basedand neural network systems

Using the hybrid model, I want to prove the efficiency of my method in limitedand complex data Not only that, I also compare the performance of my model

to the baseline model in the wealthy resource to see the downside of my methodwhen the data is rich

Due to the effectiveness of transferring the knowledge from the pre-trained guage model to the downstream NLP tasks, I also would like to test whether thisstatement is suitable for the ITN problem Additionally, by inventing the novelvariant of BERT, I would like to test whether supplementing the additional in-formation about the appearance of numerical entities in sentences can boost theperformance or not

lan-Finally, via this thesis, I desire to boost the awareness of researchers about thisproblem as well as my scenarios I hope the presented method in my thesis ishelpful to apply for real production

1.4 Related publication

Phan, T A., Nguyen, N D., Thanh, H L., Bui, K H N (2022, December) ral Inverse Text Normalization with Numerical Recognition for Low ResourceScenarios In Intelligent Information and Database Systems: 14th Asian Con-ference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022,

Trang 18

Neu-Proceedings, Part I (pp 582-594) Cham: Springer International Publishing.(Accepted)

Chapter 2 describes the literature review This chapter is further divided into 2subparts: related work and background In related work, I summarize the existingwork to deal with ITN and TN problems that are categorized into three mainapproaches The background sector provides the fundamental knowledge aboutthe backbone of my methodologies such as encoder-decoder architecture, LSTM-based, Transformer based seq2seq model, and overview of BERT

Chapter 3 reveals the detail of my methodologies In this chapter, I introducethe overview of the baseline model, the proposed model, the pipeline for creatingdata for seq2seq, the LSTM-based, transformer-based, BERT, and one variant

of BERT to apply for the encoder-decoder model In the end of this chapter, Idescribe the way I build the rule-base system for both English and Vietnamesethat plays a crucial role in the whole system

Chapter 4 firstly shows the important trait of my data, the detail of configurationsthat I conduct in my experiments in this thesis Besides, this chapter also provides

a thorough analysis and comparison between my method and the baseline

Chapter 5 summarizes my contributions to this thesis In this chapter, I also givesome interesting ideas that I am going to investigate in the future

Trang 19

Kestrel, a component of the Google TTS synthesis system [2] that concentrates

on solving the TN problem Kestrel first categorizes the numerical entities intext into multi semiotic classes: measure, percentages, currency amount, dates,cardinal, ordinal, decimal, fraction, times, telephone numbers, and electronic ad-dresses Sequentially, they use a protocol buffer with the basic unit: message.One message is essentially a dictionary, which consists of named keys and cer-tain values The values might include integers, strings, booleans, or even othernested messages The whole process can be divided into 2 stages: classification

Trang 20

and verbalization While the classifier is responsible for recognizing what otic class the corresponding token belongs to via WFST grammars, the verbalizewill receive the message and convert it into the right form In terms of evaluationresults, Kestrel achieves promising accuracy in both English and Russian Espe-cially, for both languages, their proposal reaches virtually absolute accuracy forseveral semiotic classes: cardinal, date, decimal, and electronic, and 99% onthe Google TN test data set [4].

semi-For ITN, the set of rules in [5] achieves 99% on internal data from a virtualassistant application Recently, in order to invent a new path for developing pro-duction seamlessly, Zhang et al introduce an open-source python WFST-basedlibrary [3] The illustration of the system can be described in figure 2.1 Com-pared it’s pipeline to Kestrel, this flow has somewhat similar to the way the authordefines the semiotic class, except that the process will be reversed Nemo ITNalso includes a two-stage normalization pipeline that first detects semiotic to-kens (classification) and then converts these to written form (verbalization) Bothstages consume a single WFST grammar The major problem is that this approachrequires significant effort and time to scale the system across languages Never-theless, by using the Python library Pynini to formulate and compile grammar,Nemo ITN can easily to add new rules, modify an existing class as well as add anentirely new semiotic class This is a huge advantage for deploying them into theproduction environment In terms of results, the Nemo toolkit obtains the exactmatch of 98.5% for CARDINAL and 78.65% for DECIMAL on the cleaned dataset

Overall, for using grammar rules such as WFST, we do not need an annotateddata set But this method cost significantly the language experts and leads to thechallenge to scale the model

2.1.2 Neural network model

Recurrent Neural Network (RNN)-based seq2seq models [8], have been adoptedfor reducing manual processes

For the TN problem, Sproat et al [9] consider this problem as a machine lation task and develop an RNN-based seq2seq model trained on window-baseddata Specifically, an input sentence is regarded as a sequence of characters, andthe output sentence is a sequence of words Furthermore, since the length ofthe sequence input problem, they split a sentence into chunks with a windowsize equal to three for creating sample training in which normalized tokens are

Trang 21

trans-Figure 2.1: The pipeline of NeMo toolkit for inverse text normalization.

marked by distinctive begin tag< norm >and end tag< /norm > In this regard,this approach is able to limit the number of input and output nodes to somethingreasonable Their architecture neural network follows closely that of [10]

Sequentially, Sevinj et al [11] proposed a novel end-to-end Convolutional NeuralNetwork (CNN) architecture with residual connections for the TN task Particu-larly, they consider the TN problem as the classification problem which includestwo stages: i) First, the input sentence is segmented into chunks, which is similar

to [9] and use a CNN-based model to label each chunk into corresponding classbased on scores of soft-max function; ii) After the classification stage, they ap-ply rule-based methods depending on each class Rather than using the grammarrules, leveraging the CNN model with prove the efficiency of this method withapproximately 99.44% for accuracy over total semiotic classes

Mansfield et al in [12] firstly take advantage RNN-based seq2seq model fordealing with the TN problem Especially, both input and output are processedusing subword units to overcome the OOV issue The numerical tokens are eventokenized into the character levels In addition, the linguistic features such as:1) capitalization, upper, lower, mixed, non-alphanumerical, foreign characters.2) position: beginning, middle, end, singleton 3) POS tags 4) labels also aretaken into account for enhancing the quality of neural machine translation mod-els To integrate linguistic features with subword units, the concatenate or addoperator is utilized and the combined embedding is fed into the Bi-LSTM en-coder When compared with the window-size-based method as a baseline, the

Trang 22

full model achieves the SER (sentence error rate) is only 0.78%, the WER (worderror rate) is 0.17% and the BLEU score is approximately 99.73%.

For the ITN problem, Sunkura et al in [7] also consider ITN under the form of

a machine translation task Inspire of subword tokenizer methods in [12], theyfirst tokenizer sentence using the Sentence piece toolkit [13], then feed the em-bedding feature to the encoder The output of the decoder is recovered throughseveral post-processing steps Their proposed architecture uses both RNN-basedand Transformer-based models for encoder-decoder with copy attention mech-anism in decoding The result comparisons show the best performance of theTransformer-based method in various domains of the test set by only 1.1% WERfor Wikipedia and 2.0%, 2.4%, 1.5% for CNN, Daily Mail, and News-C, re-spectively Additionally, by the impressive outcome of using neural networks forITN, the authors also consider take the pre-trained language model information toboost the performance They find out that using pre-trained models such as BART[14] as initialize for encoder and decoder does not create good performance whileusing BERT [15] to extract the contextual embedding and fuse it into each layer

of transformers as [16] can benefit minimally Beyond English, their model alsograbs the good WER score in German, Spanish, and Italian with 2.1%, 5%, and1.6%, respectively

In conclusion, the method that use a neural network model for dealing with both

TN and ITN tasks can help the model get rid of serious costs for language pert information as well as complex structures of the model These methods caneasily to expand for multiple languages and scale to huge systems Nevertheless,implementing these methods also face several challenges:

ex-• Training neural networks require numerous labeled data Practically, in most

of the research regarding ITN problem, to yield the pair-sample for training,the author uses the Google TN data set, and swaps input and output to createdata for ITN This phenomenon leads to concern about the quality of themodel under the poor resource scenario

• Because of using seq2seq as an end-to-end model, the neural networks cangenerate unrecoverable errors When handling the semiotic class as a num-ber of entities, the error can cause the severe problems in the empirical ap-plications

Trang 23

2.1.3 Hybrid model

Both using WFST and neural networks for ITN have downsides of their own.WFST-based methods strongly depend on the volume and accuracy of a set ofgrammar rule that language experts can provide, and in some cases, they arenot able to cover all situations Meanwhile, using neural network consume agreat of annotated data and suffer unrecoverable errors Therefore, many existingapproaches combine two aforementioned methods to overcome their weakness ofthem

Pusateri et al [5] presents a data-driven approach for ITN problems by a set

of simple rules and a few hand-crafted grammars to cast ITN as a labeling lem Then, a bi-directional LSTM model is adopted for solving the classificationproblem

prob-Sunkura et al [7] propose a hybrid approach combining transformer-based seq2seqmodels and FST-based text normalization techniques, where the output of theneural ITN model is passed through FST They use a confidence score that isemitted by the neural model to decide whether the system should use neural ITNoutput Intuitively, the confidence score can be considered as a filter to switch theoverall model to choose the output of FST rather than the neural model in case themodel encounters an unrecoverable error Basically, this approach is not indeedthe combination of two approaches because the final output is produced by one

of two models

2.2 Background

In this section, I introduce briefly several crucial pieces of knowledge related tothis work They include the encoder-decoder model, transformer architecture,and pre-trained language model as BERT All of them are the significant unit thatconstitutes my proposed model

2.2.1 Encoder-decoder model

The encoder-decoder model was first presented in [8] by Sutsekeve et al, ing to solve the seq2seq problem The seq2seq problem generally consists of anNLP task in which both input and output are sequences of tokens such as machinetranslation, abstractive summarization, text generation, and text normalization,

Trang 24

aim-Intuitively, one encoder-decoder model is made of two main components: oneencoder and one decoder Each component is further constituted from smallerunits, called block: encoder-block and decoder block Each encoder block re-ceives the output of the previous block as input, try to capture information tocreate the hidden state, and forwards it to the following block The hidden statecan be understood as the code of information in the vector space Consecutively,the encoded information of the encoder block is passed through the decoder block

to decode sequentially the list of tokens In each step of decoding, the decoderblock must use both information from the previous block and from the encoderblock to predict the next tokens until obtaining particular conditions The finalresult is the combination of all predicted tokens Figure 2 shows the overview

of the encoder-decoder architecture for a machine translation problem (English

→ Vietnamese) In which, the source sentence in English is passed through theencoder that is contructed by multi-stacked encoder block The output of lowerblock is used as the input of the next one Finally, the encoder try to obtain thecontext vector by capturing the intra relation between the elements in the sourcesentence Sequentially, the context vector is fed in to the decoder, and create thefinal result in target sentence in Vietnamese

In the next part of the section, I will review detail about applying the Long term memory (LSTM) to the encoder-decoder model, which is presented in [8]

short-Figure 2.2: The overview of the encoder-decoder architecture for the example is

a machine translation (English→Vietnamese)

Encoder The LSTM was introduced in [17], that is Recurrent neural network chitecture By introducing three types of new gates: the input gate regulate the

Trang 25

ar-amount of information into, forget gate decides how much information will bediscarded for the current step, the output gate controls the piece of information inthe current state to output, LSTM has the ability to deal with the vanishing gradi-entin conventional RNN model and capture effectively the information from thetime-series data With the LSTM as the encoder-block I denotes = (si)i=Ni=1 asthe input string Here,si indicate the tokenith In the encoder block numberlth,

xi is the embedding of token ith, that is one vector belongs to learnable matrixembedding, andh i is the hidden state of the model at time stepith The value of

h ican be computed base onh i−1,x ias follow:

inputi= σg(Winputxi+ Uinputhi−1 + binput) (2.2)

outputi= σg(Woutputxi+ Uoutputhi−1+ boutput) (2.3)

be used and diminished, respectively The symbolσhindicates thetanhfunction.Now, I have the hi as the output of token ith at the time step For the followinglayers (l+1),hli will become new input embedding of tokens ith: xl+1i = hli, andthe similar process will be operated until reaching the last block Finally, with oneencoder containing k blocks, I have the final hidden state arehk = hk1, hk2, , hkN.These vectors also are called context vectors The whole encoding process aims

to extract valuable information base on the characteristic of each token xi andthe knowledge about the structure of input: sequentiality Practically, instead ofonly using one-directional LSTM as above, researchers usually take advantage

of both-direction of sequence: Bi-directional LSTM for the encoder Essentially,Bi-directional LSTM (Bi-LSTM) has the same architecture as LSTM, apart fromthe information be combined by forward direction (from left to right) and back-ward direction(from right to left) The combination is operated by concatenationoperation:hBi−LST M =

h−−−−→

iThe overview of the encoder with mul-tiple blocks and integration of the LSTM model for building the encoder is given

in figure 2.3

Decoder Opposite to the functionality of the encoder, the decoder aims to

Trang 26

trans-Figure 2.3: The overview of the using LSTM-based as encoder block (left) andthe architecture of LSTM (right).

form the context vectors from vector space into output Likewise, to the ture of the encoder block, the decode block is also easily implemented by usingLSTM The Bi-LSTM is not considered because the decoding process has onlyone direction: from left to right In the work in [8], the decoder only takes the lastvector of context vectors hkN as the initialization of the hidden state of the firststep rather than vector 0 as the encoder With the decoder containing a K block,the decoding process is performed sequentially as follows:

architec-• At time step j, to avoid the confusion, I denotehkN = hencode, the hidden state

of layerkthis computed fromhj−1 and the input of layer(k − 1)th: xk−1j :

Trang 27

EN Dor reaches the limit of sequence length The overview of using LSTM forthe decoder is illustrated in figure 2.4.

Figure 2.4: The illustration of the decoding process

For enhancing the quality of the decoding process, [18] and [19] introduce ferent ways to apply attention mechanisms to improve the ability of alignment ofthe model Thanks to this mechanism, the decoder is able to decide to prefer whattoken in the encoder will have more contribution for predicting the next token

dif-2.2.2 Transformer

Recently, Transformer is a new seq2seq architecture, which has achieved highperformance for most NLP tasks [20] Accordingly, the work in [7] has shown theadvantage of the Transformer compared to RNN-based models in the ITN prob-lem Fig 2.5 depicts the general architecture of the Transformer Particularly, theTransformer reconstructed the encoder-decoder architecture that includes stackedself-attention and point-wise full connection in both the encoder and decoder

Encoder: The encoder consists of 6 stacked encoder layers The basic unit ofeach unit is a sub-layer or sub-block The first sub-block is multi-head self-attention, the other is a 2-layer feed-forward network Each sub-layer also isemployed a residual connection and one-layer normalization The dimension ofthe embedding vector is set asdmodel= 512

Decoder: The decoder also compose of 6 stacked decoder layers In each coder layer, the author adds an intermediate sub-layer, that allows the decodercan perform attention to the output of encoders Likewise to the encoder, the firstand last sub-layer still are multi-head self-attention and point-wise feed-forward

Trang 28

de-Figure 2.5: The general architecture of vanilla Transformer, which is introduced

in [20]

layer The masking mechanism is performed effectively to prevent the wrongattention to the subsequent positions on the decoder side and to the positions ofpadding tokens on the encoder side

‘The biggest difference and also create the strength of the Transformer compared

to the RNN-based encoder-decoder is multi-head attention Particularly, by ducing three types of matrices: Query, Key Value, and scale dot-product attentionand this layer out the new hidden state of each token as the weighted sum of allconsidered tokens When it comes to self-attention, it allows one token has theability to decide which level of relevance between it and other tokens around it

intro-In another case, using the cross-attention mechanism, one token is under the coding process and also decides what tokens in the encoder are most relevant to

de-it The illustration of scale dot-production attention and multi-head attention aregiven as the figure 2.6

The detail formulation of scale dot-product attention layer at head ith can be

Trang 29

Figure 2.6: The description of scale dot product attention (left) and the head attention (right).

Wherexiandx′i are the respective input and output of the attention layer WiQ ∈

Rd×dk , WiK ∈ R d×d k , WiV ∈ R d×d k are learnable parameters Q = xWiQ, K =

xWiK, andV = xWiV are the query matrix, key matrix, and value matrix of head

ith, respectively d anddk denote the hidden size of the model and headith Thescale factor√1

d k is proposed to get the model rid of too large or too small gradientsproblem

Using multi-head attention can benefit the model to jointly multiple types of formation at different positions:

in-x′ = M ulti − head(x, WQ, WK, WV) = Concate(head1, head2, , headk)WO

(2.14)

where headi= Attention(xi, WiQ, WiK, WiV)

(2.15)

Trang 30

Sequentially, the log conditional probability can be interpreted as follows:

BERT stands for Bidirectional Encoder Representations from Transformers BERT

is designed as the architecture of the encoder of the Transformer, to pre-train onunannotated data, and has the ability to combine both left and right contexts inall layers As with other pre-trained language models, BERT was born to apply

to the downstream tasks of NLP BERT learns the contextual embedding of eachtoken base on two mechanisms: mask token prediction and next sentence predic-tion The overview of the training process for BERT is given as figure 2.7 Tofulfill two tasks simultaneously, BERT is required to investigate the contextualembedding of all tokens of the input sentence Due to being trained in a largecorpus, BERT is able to collect a huge knowledge of a particular language and bevery useful for enhancing the performance of downstream NLP tasks

Figure 2.7: The overview of pertaining produces of BERT that is trained in largecorpus with the next sentence prediction and mask token prediction

Here, I only focus on analyzing the mask token prediction In the training cess, there are a certain amount of tokens in the original input to be masked Inthe final layer, the hidden state of tokens at the corresponding position is passed

Định dạng
Số trang	61
Dung lượng	1 MB

Tiêu đề	Improving the Quality of Inverse Text Normalization Based on Neural Network and Numerical Entities Recognition
Tác giả	Phan Tuan Anh
Người hướng dẫn	Associate Professor Le Thanh Huong
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Data Science and Artificial Intelligence
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Hanoi