THESIS TITLE ADVANCES IN PUNCTUATION AND DISFLUENCY PREDICTIO

9 1.3.1 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction.. 32 4 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Pred

Trang 1

ADVANCES IN PUNCTUATION AND DISFLUENCY PREDICTION

WANG XUANCONG

B.Sc (Hons.) NUS

A THESIS SUBMITTEDFOR THE DEGREE OF DOCTOR OF PHILOSOPHY

NUS GRADUATE SCHOOL FOR INTEGRATIVE

SCIENCES AND ENGINEERING

NATIONAL UNIVERSITY OF SINGAPORE

2015

Trang 3

I hereby declare that this thesis is my original work and it has been written by

me in its entirety I have duly acknowledged all the sources of information whichhave been used in the thesis

This thesis has also not been submitted for any degree in any university ously

previ-Wang Xuancong

23 January 2015

Trang 5

My PhD journey is a life journey during which I have not only learned the edge in the field of speech and natural language processing, but also learned var-ious techniques in doing research, how to collaborate with other people, how toanalyze problems and come up with effective solutions Now at the end of thisjourney, it is time to acknowledge all those who have contributed to it

knowl-First and foremost, I would like to thank my main supervisor Prof Ng HweeTou and my co-supervisor Prof Sim Khe Chai I began my initial research inspeech processing under Prof Sim Khe Chai As a physics undergraduate, Ilacked various techniques in doing computer science research Prof Sim wasvery patient and helpful in teaching me those basic experimental skills in addition

to knowledge in speech processing Later, my research focus was shifted to ural language processing (NLP) because I realized that there was a gap betweenspeech recognition and natural language processing when we talked about real-life applications, and some intermediate processing was indispensable for down-stream NLP tasks Prof Ng, with his experience for many years in the NLP field,has helped me tremendously in coming up with useful ideas and tackling difficultproblems Under their teaching and supervision, I have acquired acknowledge inboth speech and NLP field They have also spent numerous time in providing meinvaluable guidance and assistance in the writing of my papers and thesis Discus-sions with them have been very pleasant and helpful in improving my scientificskills

nat-Next, I would like to thank the other member of my thesis advisory committee,Prof Wang Ye His guidance and feedback during the time of my candidature hasalways been helpful and encouraging

Trang 6

I would also like to thank my friends, schoolmates and colleagues in NUSGraduate School for Integrative Sciences and Engineering and NUS School ofComputing for their support, helpful discussions, and fellowship.

Finally, I would like to thank my parents for their continued emotional careand spiritual support especially when I encountered difficulties or failures

Trang 7

1.1 Why do we need to predict punctuation? 3

1.2 Why do we need to predict disfluency? 6

1.3 Contributions of this Thesis 9

1.3.1 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction 9

1.3.2 A Beam-Search Decoder for Disfluency Detection 10

1.3.3 Combining Punctuation and Disfluency Prediction 11

1.4 Organization of the Thesis 11

2 Related Work 13 2.1 Sentence Boundary and Punctuation Prediction 14

2.2 Disfluency Prediction 16

2.3 Joint Learning and Joint Label Prediction 17

2.4 Model Combination using Beam-Search Decoders 19

3 Machine Learning Models 21 3.1 Conditional Random Fields 21

Trang 8

3.2 Max-margin Markov Networks (M3N) 26

3.3 Graphical Model Extension 27

3.4 The Relationship between Model Complexity and Clique Order 32 4 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction 36 4.1 Introduction 36

4.2 Model Description 37

4.3 Feature Extraction 38

4.3.1 Lexical Features 38

4.3.2 Prosodic Features 39

4.3.3 Normalized N-gram Language Model Scores 39

4.4 Experiments 40

4.4.1 Data Preparation 40

4.4.2 Incremental Local Training 41

4.4.3 Vocabulary Pruning 42

4.4.4 Experimental Results 43

4.4.5 Comparison to a two stage LCRF+LCRF 45

4.4.6 Results on the Switchboard Corpus 46

4.5 Conclusion 47

5 A Beam-Search Decoder for Disfluency Detection 48 5.1 Introduction 48

5.2 The Improved Baseline System 49

5.2.1 Node-Weighted and Label-Weighted Max-Margin Markov Networks (M3N) 50

Trang 9

5.2.2 Features 52

5.3 The Beam-Search Decoder Framework 53

5.3.1 Motivation 53

5.3.2 General Framework 55

5.3.3 Hypothesis Producers 57

5.3.4 Hypothesis Evaluators 59

5.3.5 Integrating M3N into the Decoder Framework 60

5.3.6 POS-Class Specific Expert Models 61

5.4 Experiments 63

5.4.1 Experimental Setup 63

5.4.2 Results 64

5.4.3 Discussion 67

5.5 Conclusion 68

6 Combining Punctuation and Disfluency Prediction: An Empirical Study 70 6.1 Introduction 70

6.2 The Baseline System 72

6.2.1 Experimental Setup 72

6.2.2 Features 74

6.2.3 Evaluation and Results 75

6.3 The Cascade Approach 77

6.3.1 Hard Cascade 78

6.3.2 Soft Cascade 79

6.3.3 Experimental Results 79

6.4 The Rescoring Approach 82

Trang 10

6.5 The Joint Approach 846.6 Discussion 866.7 Conclusion 88

Trang 11

With the advancement of automatic speech recognition (ASR) technology, moreand more natural language processing (NLP) applications have been used in ourdaily life For example, spoken language translation, automatic question answer-ing, and speech information retrieval When dealing with recognized spontaneousspeech, several natural problems arise Firstly, recognized speech does not havepunctuation or sentence boundary information Secondly, spontaneous speechcontains disfluency which carries no useful content information The lack of punc-tuation and sentence boundary information and the presence of disfluency affectthe performance of downstream NLP tasks

Thus, the goal of this work is to develop or improve algorithms to cally detect sentence boundaries, add punctuation, and identify disfluent words inrecognized speech so as to improve the performance of downstream NLP tasks.Specifically, we focus on punctuation prediction and disfluency prediction Forpunctuation prediction, we propose using dynamic conditional random fields forjoint sentence boundary and punctuation prediction We have also investigatedseveral model optimization techniques which are important for practical applica-tions For disfluency prediction, we propose a beam-search decoder approach.Our decoder can combine generative models like n-gram language models (LM)and discriminative models like Max-margin Markov Networks (M3N) Lastly, wehave performed an empirical study on various state-of-the-art methods for com-bining the two tasks, and we have highlighted some insights in balancing the trade-off between performance and efficiency for building practical systems

Trang 13

automati-List of Tables

4.1 Comparison of punctuation prediction F1 measures (in %) for ferent algorithms and features for punctuation prediction 454.2 Comparison of F1 measures (in %) on sentence boundary detec-tion (Stage 1) 454.3 Comparison of F1 measures (in %) on punctuation prediction (Stage2) (using predicted sentence boundaries from Stage 1 for LCRFand MaxEnt) 464.4 Comparison between DCRF and LCRF on the Switchboard corpus 465.1 Feature templates for filler word prediction 535.2 Feature templates for edit word prediction 545.3 Baseline edit F1 scores for different POS tags 555.4 POS classes for expert M3N models and their baseline F1 scores 625.5 Weighted hamming loss, v(eyt, ¯yt) for M3N for both stages 645.6 Edit detection F1 scores (%) of expert models on all words be-longing to that POS class in the test set (expert-M3N column),and baseline model on all words belonging to that POS class inthe test set (baseline-M3N column) 65

Trang 14

dif-5.7 Degradation of the overall performance by expert models pared to the baseline model 655.8 Performance of the beam-search decoder with different combina-tions of components 665.9 An example showing the effect of measuring the quality of thecleaned-up sentence 686.1 Corpus statistics for all the experiments *: each conversation pro-duces two long/sentence-joined sequences, one from each speaker 736.2 Labels for punctuation prediction and disfluency prediction 746.3 Feature templates for disfluency prediction, or punctuation pre-diction, or joint prediction for all the experiments in this chapter 766.4 Baseline results showing the degradation by joining utterancesinto long sentences, removing precision/recall balancing, and re-ducing the clique order of features All models are trained usingM3N 776.5 Performance comparison between the hard cascade method andthe soft cascade method with respect to the baseline isolated pre-diction All models are trained using M3N without balancing pre-cision and recall 80

Trang 15

com-6.6 Performance comparison between the rescoring method and thesoft-cascade method with respect to the baseline isolated predic-tion The rescoring is done on 2n2 hypotheses All models aretrained using M3N without balancing precision and recall Fig-ures in the bracket are the oracle F1 scores of the 2n2 hypotheses.

*:on the development set, the best overall result is obtained at

n = 10 836.7 Performance comparison among 2-layer FCRF, mixed-label LCRFand cross-product LCRF, with respect to the soft-cascade and theisolated prediction baseline All models are trained using GRMM,with reduced clique orders 85

Trang 17

List of Figures

3.1 Graphical structure of a linear-chain CRF of length T (three

dif-ferent ways of drawing) 29

3.2 Graphical structure of two dynamic CRFs of length T 30

3.3 Graphical structure of a 2-layer factorial CRF of length T 30

3.4 Linear-chain CRF with a 3rd-order clique 31

3.5 An illustration of the number of weights for every observation feature on each clique of a dynamic CRF 33

4.1 A graphical representation of the three basic undirected graphical models yi denotes the 1st layer label, zi denotes the 2nd layer label, and xi denotes the observation sequence 38

4.2 An example showing the two layers of factorial CRF labels for a sentence in TDT3 English corpus 39

4.3 Punctuation statistics and distribution of the number of words in an utterance in the preprocessed TDT3 corpus 41

4.4 The effect of vocabulary pruning and feature pruning The x-axis represents the value x such that the proportion of the fea-tures/vocabulary remaining after pruning is 2−x 42

Trang 18

6.1 Illustration of the rescoring pipeline framework using the fourM3N models used in the soft-cascade method: P (PU|x), P (DF|PU, x),

P (DF|x) and P (PU|DF, x) 816.2 Illustration using (a) mixed-label LCRF; (b) cross-product LCRF;and (c) 2-layer FCRF, for joint punctuation (PU) and disfluency(DF) prediction Unshaded nodes are observations and shadednodes are variables to be predicted 847.1 An overall picture showing the relationship between different ma-chine learning models used in this thesis and their evolution overtime 91

Trang 19

Chapter 1

Introduction

When we look at the advancement of human civilization, language plays a veryimportant role It is through the use of language, knowledge is spread It is alsothrough the use of language, people can communicate with one another Lan-guage can be expressed in the form of speech (spoken language) or written text(written language) After the invention of computers, especially with the popular-ization of personal computers, the interaction between human and computer hasbecome more and more frequent In the early days, the primary means of interac-tion with the computers is through electro-mechanical devices like keyboards andmice With the advancement of the computer science, especially machine learn-ing theory, people start to develop computer algorithms that can recognize humanspeech The automatic speech recognition (ASR) technology has advanced signif-icantly in recent decades Probably, some day in the future, speech will becomethe dominant way of interaction with computers and mobile devices, because it

is the natural way human beings interact with one another In fact, many naturallanguage processing (NLP) applications have already been adopted in our daily

Trang 20

life For example, iPhone’s Siri can now recognize and interpret human speechqueries and execute user commands Automatic spoken language translators (e.g.,Google Translate) can recognize speech in a source language, translate it into atarget language, and synthesize speech in the target language There are also manyother spoken language processing tasks which are still undergoing research, e.g.,automatic creation of meeting minutes, telephone speech tracking, etc In a typi-cal application framework, the ASR system is used to convert speech into text fordownstream NLP systems to process.

One problem when dealing with ASR output is that it does not have tion, nor complete sentence boundary information (Ostendorf et al., 2008) Punc-tuation is a set of symbols that is used to divide text into sentences, clauses, etc.,for the disambiguation of meaning They only occur in written language and arenot pronounced in spoken language Thus, conventional ASR systems do not out-put punctuation symbols because they only model audible speech sounds How-ever, ASR systems are able to detect silence duration and use that information topredict sentence boundaries Typically, if the silence duration is longer than somepre-set threshold, a sentence boundary is set Therefore, if the speaker pauses toolong in the middle of a sentence or pauses too short between two sentences: theASR system might not be able to predict the sentence boundaries accurately.The other problem in processing human speech is that spontaneous speech of-ten contains disfluency Some literature study (Shriberg, 1999) shows that about5–10% of natural conversations are disfluent The proportion not only varies frompeople to people, from country to country, it also depends on the circumstance inwhich the speech is made For example, everyday telephone conversations usu-ally contain more disfluency than news reports The presence of disfluency affects

Trang 21

punctua-ASR performance significantly, the influence is more prominent in spontaneousspeech as compared to read speech Moreover, the presence of disfluency alsoconfuses downstream NLP applications since most systems are trained using flu-ent text.

In the rest of this chapter, we will first introduce the use of punctuation inwritten text and the presence of disfluency in spontaneous speech After that, wewill give a brief summary of the three main contributions of this thesis, followed

by an outline of the thesis

In the literature, the technical term for spotting disfluent word tokens in atext is called “disfluency detection” because the disfluent word tokens are alreadypresent in the text for us to identify However, the technical term for insertingpunctuation symbols into unpunctuated text is called “punctuation prediction” be-cause punctuation is not present in the original text, so the algorithm needs to findpossible locations and insert an appropriate punctuation symbol at each location

In this thesis, since we have treated both tasks as label prediction tasks, we willrefer to both problems as prediction tasks for simplicity We may also use theterm “disfluency detection” and “disfluency prediction” interchangeably in somesections, i.e., the term “disfluency prediction” in this thesis refers to “disfluencydetection” in the literature

Punctuation is a very important constituent in written language It is a product

of language evolution because not all languages contain punctuation since thebeginning of the time For example, punctuation was not used in Japanese and

Trang 22

Korean writing until the late 19th century and early 20th century Moreover, thepunctuation used in ancient Chinese is very different from now In fact, most ofthe ancient inscriptions do not contain punctuation.

The reason why humans introduce punctuation into written language is cause without punctuation, the meaning of a sequence of words can often be am-biguous This kind of ambiguity can occur both inside a sentence (intra-sentence)and across sentences (inter-sentence) At the intra-sentence level, for example,consider the following two sentences:

be-“Woman, without her man, is nothing.”

“Woman: without her, man is nothing.”

The first sentence is essentially saying “woman is nothing”, and it emphasizesthe importance of man; while the second sentence essentially says that “man isnothing”, and it emphasizes the importance of woman (example adopted fromWikipedia) This ambiguity arises because the word ‘her’ can be either a pronounfor third person singular or a possessive determiner for belonging to a femaleentity Moreover, this kind of ambiguity can also occur at the inter-sentence level.For example,

“John fell sick In the hospital, there was another man.”

“John fell sick in the hospital There was another man.”

Without punctuation, we are not sure whether there was another man in the pital or John fell sick in the hospital This ambiguity arises because the adver-bial phrase “in the hospital” can either post-modify the previous sentence or pre-modify the next sentence As such, there is the uncertainty in the position of the

Trang 23

hos-sentence boundary Moreover, sometimes whether a hos-sentence boundary is presentcan also lead to some ambiguity For example,

“I don’t know why.”

“I don’t know Why?”

In the first case, the speaker is saying in one sentence that he/she does not know thereason However, the second case splits the word sequence into two sentences, inthe first sentence, the speaker declares that he/she does not know and in the secondsentence, he/she is asking for the reason Without the knowledge of the context orwithout listening to the actual speech, it is very difficult to determine which case

is more appropriate because both are grammatically correct

Interestingly, this kind of structural ambiguity can be partially resolved inspeech by increasing the pause duration after those words followed by punctu-ation symbols This is one reason why without punctuation symbols, the raw textcontains less information than the corresponding speech In addition to this, wecan speak a statement-like sentence in a rising tone to turn it into a question Sim-ilarly, in text form, we can put a question mark at the end of a statement-likesentence to denote that it is a question, e.g., “you are sure about that?” Further-more, a sentence can also be spoken in a more emphatic form to express emotion.Such sentences are called exclamatory sentences and we denote them by endingthe sentence with an exclamation mark Features such as pause duration and ris-ing/falling tone are also called prosodic features or acoustic features because theydescribe the characteristics of speech sound From these two examples, we canalso see that both prosody and punctuation introduce additional information apartfrom the raw sequence of words

Trang 24

From the above analysis, punctuation has two main purposes: firstly, it breaks

up a sequence of words into smaller linguistic units to establish a hierarchicalstructure, which can reduce ambiguity and make the text easier to read; secondly,

it indicates the purpose of the sentence Therefore, by predicting punctuation

in a text, we can recover structure information in the original text and reduceambiguity in its meaning, which can aid parsing and semantic analysis We canalso infer its sentence type, e.g., whether it is a question or a statement sentence,which can be useful for machine translation, because given the same sequence ofwords, translating it as a question can be very different from translating it as astatement

Disfluency is an artefact of spoken language It only occurs in speech, but not

in written text A disfluent speech may contain breaks, irregularities and othernon-lexical vocables Disfluency can result from a few factors It could be thatthe speaker has made a mistake in speech and wants to make a correction It couldalso be that people pause for a short moment to think about what they should saynext Moreover, some people have the habit of inserting words/phrases such as

“uh-huh”, “I mean”, “you know”, etc., every so often while they speak

Not all types of disfluencies will show up in the ASR outputs For example,

if a speaker pauses to think for a moment without making any voiced sounds, andresumes his speech without speaking any words incorrectly, then provided thatthe speech is transcribed correctly, the ASR output will not contain disfluency.Sometimes, the speaker might have spoken an incomplete word and then aborted

Trang 25

that word Since conventional ASR systems do not output partial words, the ASRsystem might output either no words, a word different from the intended word,

or the intended word Partial word detection and elimination are considered asdisfluency processing at sub-word level They are usually handled by the speechrecognizer In the NLP literature for disfluency prediction, people mainly focus

on word-level disfluency that is reflected in the text form

At the text level, there are mainly two types of disfluencies: filler words andedit words Filler words include filled pauses (e.g., ‘uh’, ‘um’) and discoursemarkers (e.g., “I mean”, “you know”) They are insertions in spontaneous speech

to indicate pauses or mark boundaries in discourse Edit words are words thatare spoken wrongly and then corrected by the speaker For example, consider theutterance:

“uh I mean” are called fillers They are inserted to give the speaker some time tothink about the correct destination and to give listeners a cue that he is making acorrection afterwards More complex disfluencies can be reduced to this simplescheme For example, the speaker can restart a sentence or abort a sentence Inthat case, the entire incomplete sentence is the reparandum

Disfluency varies significantly from people to people, and it also depends on

Trang 26

the occasion of the speech Some people are linguistically very talented, they canspeak very fluently, and their speech naturally contains very few disfluencies Onsome occasions such as news broadcasts or public conferences, the speech is well-prepared and expected to sound professional, and as such it usually contains veryfew disfluencies However for most people, including non-native speakers in theireveryday life, when they speak to someone or talk on the phone, their speech willcontain disfluency to various extents.

Disfluency usually does not carry any useful content information unless one isexamining someone’s speech skills or during a criminal investigation where un-usual speech acts like stuttering could suggest that the person is lying However,disfluency may cause confusion to practical NLP tasks such as information extrac-tion, language understanding, or machine translation For example, for an onlineticket-booking system which recognizes human voice, a person might make a mis-take in speaking the time, the date, or the destination in various ways, and thenmake a correction This might cause confusion to the booking system if it doesnot handle disfluency properly For machine translation, the disfluency behavior

in one language could be different in another language For example, the disfluentEnglish sentence, “I I I don’t know”, can be translated into a disfluent Chinesesentence by preserving the repetition of the word ‘I’ On the other hand, to trans-late disfluent sentences such as “I mean he is right” or “you know he is right” intoChinese, we should discard the two filler phrase “I mean” and “you know” This

is because in English, filler phrases such as “I mean” and “you know” can either

be a filler which does not carry any meaning, or literally mean “my thinking isthat” and “you have the knowledge of” respectively But in other languages such

as Chinese, the corresponding translations of these filler phrases may not be used

Trang 27

as fillers Therefore, for practical translation, unless disfluency information is plicitly required, we usually translate only the cleaned-up utterance without anyedit or filler words.

ex-Overall, disfluency prediction is used to clean up disfluency in speech so as

to improve the accuracy and reduce ambiguity in downstream NLP tasks such asinformation extraction and machine translation

1.3 Contributions of this Thesis

This thesis consists mainly of three parts: punctuation prediction, disfluency diction, and joint punctuation and disfluency prediction Parts of this thesis havebeen published in the following papers: (Wang et al., 2012), (Wang et al., 2014a)and (Wang et al., 2014b)

pre-1.3.1 Dynamic Conditional Random Fields for Joint Sentence

Boundary and Punctuation Prediction

For English, the type of a sentence is closely related to the starting words In ticular, if a sentences starts with ‘what’, ‘why’, ‘when’, etc., then most likely it is

par-a question sentence If par-a sentence stpar-arts with ‘how’, then it could be par-a question

or an exclamatory sentence However, existing approaches to punctuation tion using linear-chain Conditional Random Field (LCRF) or Maximum EntropyModel (Maxent) are not able to make use of this feature if the sentence is toolong This is because of the limitation of the lexical context window when ex-tracting features By making use of Dynamic Conditional Random Field (DCRF),

Trang 28

predic-we can propagate this piece of information over a much longer distance using anadditional layer of labels Experimentally, it turns out that this approach not onlycan improve the performance of punctuation prediction, but also predict sentenceboundaries more accurately In Chapter 4, we describe this approach in details.

In addition, we also describe several pruning techniques which not only make themodel smaller, but also improve the performance slightly These techniques can

be useful for practical purposes

1.3.2 A Beam-Search Decoder for Disfluency Detection

Most state-of-the-art approaches to disfluency detection consist of a single-stepprediction There are several drawbacks for this Firstly, the existing approaches

do not take into consideration the quality of the cleaned-up utterances times, even for humans, we need to look at the cleaned-up sentence in order todetermine whether a word is disfluent Secondly, one pass of disfluency clean-

Some-up may not be enough to remove all the disfluent tokens because the disfluencypattern can sometimes be very complex when the speaker speaks the same termwrongly for several times and makes multiple corrections In such cases, thepresence of one disfluent token may interfere with the detection of the surround-ing disfluent tokens Moreover, the characteristics of disfluency are different forwords of different parts-of-speech (POS) (e.g., noun, verb, adjective, etc.) How-ever, the model does not make a very clear distinction on this, which causes thedisfluency detection accuracy for words of certain POS tags to be much lowerthan others To overcome these limitations, we propose a beam-search decoderfor disfluency detection The proposed decoder can perform multiple iterations of

Trang 29

cleaning up, evaluate the quality of a cleaned-up utterance, and combine severalPOS-specific expert systems Overall, it achieves about 2% absolute improvementover the previous work (F-score = 84.1%).

1.3.3 Combining Punctuation and Disfluency Prediction

In the literature, researchers have treated punctuation and disfluency prediction astwo independent tasks, both as post-processing steps for speech recognition Bydefault, people will apply them in cascade, i.e., one task followed by the other

A natural question that will arise is which task we should perform first Ouranalysis shows that no matter which task is performed first, the system will alwaysnot be able to handle some cases very well because some features that predictpunctuation will interfere with features that predict disfluency In such cases,joint prediction might be more advantageous because the model is able to learnboth tasks at the same time, eliminating a fixed ordering of the two tasks Inthis work, we have evaluated and compared many state-of-the-art joint predictionmethods We show that the two tasks influence each other, the prediction in onetask can provide useful information to the other task and thus, joint predictionworks better than isolated prediction However, using joint prediction models willlead to higher model complexity, which limits its application in practice

The remainder of this thesis is organized as follows The next chapter gives anoverview of related work in sentence boundary prediction, punctuation predic-tion, and disfluency detection Chapter 3 describes the major machine learn-

Trang 30

ing algorithms used in this thesis, namely linear-chain conditional random field(CRF), dynamic conditional random field (DCRF), and max-margin Markov net-work (M3N) Chapter 4 focuses on joint sentence boundary and punctuation pre-diction Chapter 5 describes the beam-search decoder for disfluency detection.Chapter 6 conducts an empirical study on various state-of-the-art joint learningmethods for performing joint punctuation and disfluency prediction Chapter 7concludes this thesis, highlighting several possible future works on this topic.

Trang 31

Chapter 2

Related Work

Research is the process of aggregation and evolution of mankind’s collectiveknowledge and understanding Analogous to the fact that each mathematical the-orem is derived from some other theorems or axioms, research can never be done

in isolation from the body of existing knowledge Instead, it has to lay its basis

on some related previous works and facilities The famous scientist Isaac Newtonhas once put in his letter, “if I have seen further it is by standing on the shoulders

of giants” There is no doubt that he was emphasizing the importance of previousworks and giving them credit for his achievement Similarly, natural languageprocessing makes use of many algorithms in machine learning It also makes use

of a lot of knowledge from linguistics In fact, it is a “marriage” between the twofields of knowledge Thus, people also call it “computational linguistics”

In this chapter, we give an overview of related work on sentence boundaryprediction, punctuation prediction, and disfluency detection The core algorithmsare based on log-linear models and graphical models They will be described inmore details in the next chapter

Trang 32

2.1 Sentence Boundary and Punctuation Prediction

Much research on sentence boundary and punctuation prediction has been carriedout in the speech and language processing field For sentence boundary predic-tion, the most straight-forward method is to compute the duration of the silencepause (technically called “pause duration”) from the speech audio This methoddoes not need to use any classifier but it requires both the speech audio and the cor-responding text to run forced alignment (Jelinek, 1997) Standard ASR systemssplit speech into segments according to the pause duration (Young et al., 1997) Asegment is split if the pause duration is longer than some pre-determined thresh-old or if the length of the current speech segment exceeds the maximum limit thatthe speech recognizer can handle In addition to the pause duration feature, (Wangand Narayanan, 2004) has developed a multi-pass algorithm that uses pitch breaksand pitch durations However, they did not use any textual information, which hasbeen shown to be very important for detecting sentence boundaries Studies onsentence boundary detection in speech have also been conducted for other lan-guages such as Chinese (Zong and Ren, 2003) and Czech (Kolar et al., 2004).(Liu et al., 2006) has made a comparison among Hidden Markov Model (HMM),Maximum Entropy Model (MaxEnt), and Conditional Random Field (CRF) (Laf-ferty et al., 2001) for sentence boundary detection

For punctuation prediction, the earliest research exploited lexical features only.(Beeferman et al., 1998) used trigram language modeling for comma prediction

by treating commas as words (Stolcke et al., 1998) proposed a hidden event guage model that treated sentence boundary detection and punctuation insertion asinterword hidden event detection tasks Their proposed method was implemented

Trang 33

lan-in the open-source utility hidden-ngram as part of the SRILM toolkit (Stolcke,2002) (Gravano et al., 2009) presented a purely n-gram based approach using fi-nite state automata (FSA) that jointly predicted punctuation and case informationfor English (Liu et al., 2006) has also made a comparison among HMM, MaxEnt,and CRF on punctuation prediction as well The work by (Lu and Ng, 2010) madeuse of dynamic conditional random fields (DCRF) by jointly predicting sentencetype and punctuation, but without using prosodic or language model information.Moreover, in their experiments, the sentences have already been split As shown

in one of our experiments, it is much easier to predict punctuation on individualsentences than on sentences joint together

Prosodic information has been shown to be helpful for punctuation prediction.There are several works that make use of both prosodic and lexical features (Kimand Woodland, 2001) combined prosodic and lexical information for punctuationprediction In their work, prosodic features were incorporated using the classifi-cation and regression tree (CART), and lexical information was extracted in theform of language model scores (Christensen et al., 2001) investigated both finitestate and multi-layer perceptron methods on punctuating broadcast news, makinguse of both prosodic and linguistic information (Huang and Zweig, 2002) usedmaximum entropy model (MaxEnt) for punctuation insertion in English conver-sational speech (Liu et al., 2006) also made use of both prosodic and lexicalfeatures in their work

From the above, we can see that earlier work tends to use less memory sive features such as prosodic features and language model scores This is due

inten-to the limitation of computational power because lexical features such as wordn-gram features and POS n-gram features (described in Chapter 4) are extremely

Trang 34

memory consuming They need several gigabytes of memory for training Andbefore the year 2000, very few computers had that much memory With the ad-vancement of faster processor and much larger memory, using more features be-comes feasible.

Researchers have tried many ways to detect disfluency (Johnson and Charniak,2004) proposed a TAG-based (Tree-Adjoining Grammar) noisy channel model,which showed great improvement over a boosting-based classifier (Charniak andJohnson, 2001) (Maskey et al., 2006) proposed a phrase-level machine trans-lation approach for this task (Liu et al., 2006) used conditional random field(CRF) (Lafferty et al., 2001) for sentence boundary and edit word detection Theyshowed that CRF significantly outperformed maximum entropy models and hid-den Markov models (HMM) (Zwarts and Johnson, 2011) extended this modelusing minimal expected F-loss oriented n-best reranking (Georgila, 2009) pre-sented a post-processing method during testing based on integer linear program-ming (ILP) to incorporate local and global constraints

In addition to textual information, prosodic features extracted from speechhave been shown to be helpful in detecting edit words in some previous works(Kahn et al., 2005; Liu et al., 2006; Zhang et al., 2006) (Savova and Bachenko,2003) has done an explicit study on the prosodic features for four types of disfluen-cies (Zwarts and Johnson, 2011) also trained extra language models on additionalcorpora, and compared the effects of adding scores from different language mod-els as features during reranking They reported that using large language models

Trang 35

trained on external data sources, they could gain approximately 3% in F1-scorefor edit word detection on the Switchboard development dataset (Qian and Liu,2013) has made use of weighted Max-margin Markov Networks (M3N) to betterbalance precision and recall They also proposed multi-step disfluency detectionand achieved the highest F-score of 84.1% without using any external source ofinformation In this thesis, we incorporate their M3N model into our beam-searchdecoder framework with some additional features to further improve the score.All the previous works only focused on how to learn to detect disfluencies ac-curately However, they have not considered evaluating the fluency of the cleaned-

up utterances Also, only the work by (Qian and Liu, 2013) has found that fluency detection performance can be further improved by performing additionalpasses of detection and their system performed only one additional step of disflu-ency clean-up Our design of the beam-search decoder has taken all these issuesinto consideration

Natural language processing utilizes machine learning algorithms In fact, themethod we used for punctuation and disfluency prediction is adopted from sparse-feature label-sequence-prediction algorithms in machine learning In machinelearning, researchers have also developed algorithms to predict multiple layers

of labels together (Ng and Low, 2004) has proposed cross-product label tion for Chinese word segmentation and POS tagging using maximum entropymodel The cross-product label space is created by composing labels at differentlayers They also made a comparison between joint prediction and cascade predic-

Trang 36

predic-tion (Shi and Wang, 2007) has proposed a dual-layer CRF based joint decoding

on the same joint task They combined two linear-chain CRFs at decoding level

by merging their respective scores (Sutton et al., 2007) has proposed dynamicCRF which models joint label distribution directly It can predict multiple layers

of labels in a joint manner The toolkit he developed, GRMM (Sutton, 2006), alsosupports more arbitrary graphical structures, which might be useful in some othertasks

For NLP applications, there were also some works that addressed both ation and disfluency prediction (Liu et al., 2006) and Baron et al (2002) carriedout sentence unit (SU) and disfluency prediction as separate tasks The differencebetween SU prediction and punctuation prediction lies only in the non-end-of-sentence punctuation symbols such as commas (Stolcke et al., 1998) mixed sen-tence boundary labels with disfluency labels so that they did not predict punctua-tion on disfluent tokens (Kim, 2004) performed joint SU and Interruption Point(IP) prediction In their work, edit and filler regions are derived from predictedIPs using a rule-based system as a separate step

punctu-Although these works have addressed both punctuation and disfluency diction, they have not treated the two tasks as one joint prediction task, and theinteraction between the two tasks is not well studied In general, when there aremultiple prediction tasks on the same data, it will often be more advantageous toperform joint prediction because the prediction output of one task is often cor-related with other tasks, e.g., if the punctuation after the current word is period,question mark, or exclamation mark, then the current word must be followed by asentence boundary Empirically, modelling the joint distribution is more powerful

pre-as it can capture more complex distribution in the data

Trang 37

2.4 Model Combination using Beam-Search Decoders

In the previous section, we mentioned the advantage of joint prediction In fact,this advantage is quite theoretical and in principle In practice, modeling jointdistribution often leads to very high model complexity and insufficient trainingdata Moreover, sometimes the individual models can be too complicated to bejointly combined together To overcome this, (Russell and Norvig, 2009) hastalked about the beam search algorithm in their book “Artificial Intelligence: AModern Approach” It is a heuristic search algorithm that iteratively searchesthrough the hypothesis space in order to find a good solution As a breadth-firstsearch algorithm in essence, the main advantage is that it does not require anymodification of existing models when combining them Instead, it treats eachexisting model as a black box This adds much flexibility to model combinationbecause in this framework, low-level modeling and high-level modeling are verywell isolated At the lower level, we have the individual complex models, each

of which can have thousands or even millions of weights Each low-level modelproduces one or more scores as suggestions or recommendations to the higherlevel At the higher level, we combine the scores from these low-level models,and tune the model combination weights

Beam-search decoders are widely used in applications where it is difficult ornot so obvious to combine several complex models For example, in statisticalmachine translation (SMT) systems such as Moses (Koehn et al., 2007)), beam-search decoding is used to combine the translation model and the language model.The translation model measures how likely a source phrase can be translated intovarious target phases The language model measures the quality of the translated

Trang 38

sentence by computing the likelihood of the resulting word sequence It is noteasy to combine these two models because there are exponentially many ways tosplit a source sentence into phrases Each source phrase can be translated intomany different target phrases Moreover, each candidate target phrase can beinserted at different places since re-ordering often occurs during translation Theresulting search space is too huge to handle if we use exhaustive search withoutany heuristic approximation That is why we need to use the beam-search decoder

to perform heuristic search Apart from machine translation, it is also used inautomatic speech recognition (Young et al., 1997), grammatical error correction(Dahlmeier and Ng, 2012), social media text normalization (Wang and Ng, 2013),etc In Chapter 5, we make use of the beam search decoding algorithm to performdisfluency detection

Trang 39

Chapter 3

Machine Learning Models

In this chapter, we describe in details the main machine learning algorithms used

in this thesis, namely conditional random field (CRF), max-margin Markov work (M3N), as well as their graphical model extension We will also discussthe relationship between model complexity and the clique order of the features.These are important practical concerns, because they directly affect the feasibil-ity and the performance of a model trained on a corpus We will also mentionbriefly our modification of the existing M3N toolkit to obtain both performanceenhancement and additional functional support

Maximum entropy model (MaxEnt) and conditional random field (CRF) belong

to log-linear models They originate from logistic regression Logistic regression

Trang 40

makes use of the logistic function to model probabilities:

P (t) = exp(t)

exp(t) + 1 =

1

1 + exp(−t) (3.1)where t is the free variable, P (t) denotes the probability, and exp is the exponen-tial function The reason to use logistic function instead of a linear function isbecause probability ranges from 0 to 1, while the value of a linear function rangesfrom −∞ to ∞ Take note that in Equation 3.1, as t ∈ (−∞, ∞), P (t) ∈ (0, 1).Mathematically, the logistic function provides a conformal mapping (a smoothmap which preserves angles, (Nehari, 1975)) between the probability space andthe linear space

For classification, we view t as a linear combination of explanatory variables(a weighted linear combination of observed feature variables):

P (y) = exp(w · x)

where x is the binary feature vector indicating what features are active and w

is the feature weight vector, its nth element indicates the extent to which the nthfeature contributes to the probability, w · x is the dot-product or scalar productbetween the two vectors

The above classification scheme only works for two classes Multinomial gistical regression extends the classification to multiple classes In multinomiallogistical regression (Class 1 to Class J ), there is a comparison class for which

Định dạng
Số trang	119
Dung lượng	1,77 MB