1. Trang chủ
  2. » Ngoại Ngữ

Acoustic keyword spotting in speech with applications to data mining

248 166 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 248
Dung lượng 1,18 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword fication, Confidence Scoring, Speech Recognition, Utterance Verification Veri-i... Keyword spotting is particularlywe

Trang 1

Speech and Audio Research Laboratory of the SAIVT program Centre for Built Environment and Engineering Research

ACOUSTIC KEYWORD SPOTTING

IN SPEECH WITH APPLICATIONS

AT QUEENSLAND UNIVERSITY OF TECHNOLOGY

BRISBANE, QUEENSLAND

9 MARCH 2005

Trang 3

Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword fication, Confidence Scoring, Speech Recognition, Utterance Verification

Veri-i

Trang 5

Keyword Spotting is the task of detecting keywords of interest within ous speech The applications of this technology range from call centre dialoguesystems to covert speech surveillance devices Keyword spotting is particularlywell suited to data mining tasks such as real-time keyword monitoring and unre-stricted vocabulary audio document indexing However, to date, many keywordspotting approaches have suffered from poor detection rates, high false alarmrates, or slow execution times, thus reducing their commercial viability

continu-This work investigates the application of keyword spotting to data miningtasks The thesis makes a number of major contributions to the field of keywordspotting

The first major contribution is the development of a novel keyword verificationmethod named Cohort Word Verification This method combines high level lin-guistic information with cohort-based verification techniques to obtain dramaticimprovements in verification performance, in particular for the problematic shortduration target word class

The second major contribution is the development of a novel audio documentindexing technique named Dynamic Match Lattice Spotting This technique aug-ments lattice-based audio indexing principles with dynamic sequence matchingtechniques to provide robustness to erroneous lattice realisations The resultingalgorithm obtains significant improvement in detection rate over lattice-based

iii

Trang 6

The third major contribution is the study of multiple verifier fusion for the task

of keyword verification The reported experiments demonstrate that substantialimprovements in verification performance can be obtained through the fusion

of multiple keyword verifiers The research focuses on combinations of speechbackground model based verifiers and cohort word verifiers

The final major contribution is a comprehensive study of the effects of limitedtraining data for keyword spotting This study is performed with consideration

as to how these effects impact the immediate development and deployment ofspeech technologies for non-English languages

iv

Trang 7

1.1 Overview 1

1.1.1 Aims and Objectives 2

1.1.2 Research Scope 3

1.2 Thesis Organisation 4

1.3 Major Contributions of this Research 6

1.4 List of Publications 7

2 A Review of Keyword Spotting 9 2.1 Introduction 9

v

Trang 8

2.3 Applications of keyword spotting 11

2.3.1 Keyword monitoring applications 11

2.3.2 Audio document indexing 13

2.3.3 Command controlled devices 13

2.3.4 Dialogue systems 14

2.4 The development of keyword spotting 15

2.4.1 Sliding window approaches 15

2.4.2 Non-keyword model approaches 16

2.4.3 Hidden Markov Model approaches 17

2.4.4 Further developments 17

2.5 Performance Measures 18

2.5.1 The reference and result sets 19

2.5.2 The hit operator 19

2.5.3 Miss rate 20

2.5.4 False alarm rate 21

2.5.5 False acceptance rate 21

2.5.6 Execution time 22

2.5.7 Figure of Merit 22

2.5.8 Equal Error Rate 23

2.5.9 Receiver Operating Characteristic Curves 24

2.5.10 Detection Error Trade-off Plots 25

2.6 Unconstrained vocabulary spotting 26

2.6.1 HMM-based approach 26

2.6.2 Neural Network Approaches 28

2.7 Approaches to non-keyword modeling 31

2.7.1 Speech background model 31

2.7.2 Phone models 33

vi

Trang 9

2.7.3 Uniform distribution 34

2.7.4 Online garbage model 34

2.8 Constrained vocabulary spotting 36

2.8.1 Language model approaches 36

2.8.2 Event spotting 39

2.9 Keyword verification 41

2.9.1 A formal definition 42

2.9.2 Combining keyword spotting and verification 42

2.9.3 The problem of short duration keywords 43

2.9.4 Likelihood ratio based approaches 43

2.9.5 Alternate Information Sources 46

2.10 Audio Document Indexing 47

2.10.1 Limitations of the Speech-to-Text Transcription approach 48

2.10.2 Reverse dictionary lookup searches 49

2.10.3 Indexed reverse dictionary lookup searches 51

2.10.4 Lattice based searches 53

3 HMM-based spotting and verification 57 3.1 Introduction 57

3.2 The confusability circle framework 58

3.3 Analysis of non-keyword models 60

3.3.1 All-speech models 60

3.3.2 SBM methods 61

3.3.3 Phone-set methods 62

3.3.4 Target-word-excluding methods 62

3.4 Evaluation of keyword spotting techniques 63

3.4.1 Experiment setup 64

vii

Trang 10

3.5 Tuning the phone set non-keyword model 68

3.6 Output score thresholding for SBM spotting 70

3.7 Performance across keyword length 72

3.7.1 Evaluation sets 73

3.7.2 Results 73

3.8 HMM-based keyword verification 74

3.8.1 Evaluation set 76

3.8.2 Evaluation procedure 77

3.8.3 Results 77

3.9 Discriminative background model KV 79

3.9.1 System architecture 79

3.9.2 Results 80

3.10 Summary and Conclusions 82

4 Cohort word keyword verification 85 4.1 Introduction 85

4.2 Foundational concepts 87

4.2.1 Cohort-based scoring 87

4.2.2 The use of language information 88

4.3 Overview of the cohort word technique 90

4.4 Cohort word set construction 92

4.4.1 The choice of dmin and dmax 92

4.4.2 Cohort word set downsampling 94

4.4.3 Distance function 94

4.5 Classification approach 96

4.5.1 2-class classification approach 96

viii

Trang 11

4.5.2 Hybrid N-class approach 98

4.6 Summary of the cohort word algorithm 100

4.7 Comparison of classifier approaches 101

4.7.1 Evaluation set 102

4.7.2 Recogniser parameters 103

4.7.3 Cohort word selection 103

4.7.4 Evaluation procedure 104

4.7.5 Results 104

4.8 Performance across target keyword length 106

4.8.1 Evaluation set 106

4.8.2 Recogniser parameters 107

4.8.3 Results 108

4.8.4 Analysis of poor 8-phone performance 110

4.8.5 Conclusions 111

4.9 Effects of selection parameters 113

4.9.1 Cohort word set downsampling 114

4.9.2 Cohort word selection range 116

4.9.3 MED cost parameters 119

4.9.4 Conclusions 121

4.10 Fused cohort word systems 122

4.10.1 Training dataset 123

4.10.2 Neural network architecture 123

4.10.3 Experimental procedure 123

4.10.4 Baseline unfused results 124

4.10.5 Fused SBM-CW experiments 125

4.10.6 Fused CW-CW experiments 128

4.10.7 Comparison of fused and unfused systems 129

4.11 Conclusions and Summary 133

ix

Trang 12

5.1 Introduction 137

5.2 Motivation 138

5.3 Dynamic Match Lattice Spotting method 140

5.3.1 Basic method 143

5.3.2 Optimised Dynamic Match Lattice Search 145

5.4 Evaluation of DMLS performance 146

5.4.1 Evaluation set 146

5.4.2 Recogniser parameters 147

5.4.3 Lattice building 147

5.4.4 Query-time processing 148

5.4.5 Baseline systems 149

5.4.6 Evaluation procedure 150

5.4.7 Results 150

5.5 Analysis of dynamic match rules 152

5.5.1 System configurations 153

5.5.2 Results 154

5.6 Analysis of DMLS algorithm parameters 156

5.6.1 Number of lattice generation tokens 157

5.6.2 Pruning beamwidth 158

5.6.3 Number of lattice traversal tokens 159

5.6.4 MED cost threshold 160

5.6.5 Tuned systems 162

5.6.6 Conclusions 163

5.7 Conversational telephone speech experiments 165

5.7.1 Evaluation set 165

5.7.2 Recogniser parameters 165

x

Trang 13

5.7.3 Results 166

5.8 Non-destructive optimisations 168

5.8.1 Prefix sequence optimisation 169

5.8.2 Early stopping optimisation 171

5.8.3 Combining optimisations 173

5.9 Optimised system timings 174

5.9.1 Experimental procedure 175

5.9.2 Results 176

5.10 Summary 177

6 Non-English Spotting 181 6.1 Introduction 181

6.2 The issue of limited resources 182

6.3 The role of keyword spotting 184

6.4 Experiment setup 184

6.4.1 Database design 185

6.4.2 Model architectures 186

6.4.3 Evaluation set design 188

6.4.4 Evaluation procedure 188

6.5 English and Spanish stage 1 evaluations 189

6.6 English and Spanish post keyword verification 192

6.7 Indonesian spotting and verification 197

6.8 Extrapolating Indonesian performance 198

6.9 Summary and Conclusions 200

7 Summary, Conclusions and Future Work 203 7.1 HMM-based Spotting and Verification 203

7.1.1 Conclusions 203

xi

Trang 14

7.2 Cohort Word Verification 205

7.2.1 Conclusions 205

7.2.2 Future Work 206

7.3 Dynamic Match Lattice Spotting 206

7.3.1 Conclusions 207

7.3.2 Future Work 208

7.4 Non-English Spotting 208

7.4.1 Conclusions 208

7.5 Final Comments 209

Bibliography 210 A The Levenstein Distance 217 A.1 Introduction 217

A.2 Applications 217

A.3 Algorithm 218

xii

Trang 15

List of Tables

3.1 Keyword spotting performance of baseline systems on Switchboard

1 data 663.2 Effect of target word insertion penalty on PM-KS performance 693.3 Equal error rates of unnormalised and duration normalised outputscore thresholding applied to SBM-KS 713.4 Details of phone-length dependent evaluation sets 733.5 SBM-KS performance on Switchboard 1 data for different phone-length target words 743.6 Statistics for keyword verification evaluation sets 773.7 Equal error rates for SBM-based keyword verification 783.8 Equal error rates for SBM and MLP-SBM keyword verification 82

4.1 Evaluated cohort word selection parameters 1034.2 Performance of selected cohort word KV systems on TIMIT eval-uation set Cohort word systems are qualified with the appropri-ate cohort word selection parameters using a tag in the format{dmin, dmax, ψd, ψi} 1054.3 Performance of SBM-KV and selected cohort word systems on theSWB1 evaluation sets Cohort word selection parameters are spec-ified with each system in the format {dmin, dmax, ψd, ψi} 108

xiii

Trang 16

in the 3 best performing cohort word KV methods for the SWB1

evaluation set 111

4.5 Performance of baseline SBM-KV and best cohort word systems on the SWB1 evaluation sets 124

4.6 Performance of the best fused SBM-cohort systems on the SWB1 evaluation sets 125

4.7 Performance of the best fused cohort-cohort systems on the SWB1 evaluation sets 128

4.8 Correlation analysis of fused EER and individual unfused EER 130

4.9 Summary of best performing systems 135

5.1 Phone substitution costs for DMLS 149

5.2 Baseline keyword spotting results evaluated on TIMIT 151

5.3 TIMIT performance when isolating various DP rules 154

5.4 Effect of adjusting number of lattice generation tokens 157

5.5 Effect of adjusting pruning beamwidth 158

5.6 Effect of adjusting number of traversal tokens 160

5.7 Effect of adjusting MED cost threshold Smax 161

5.8 Optimised DMLS configurations evaluated on TIMIT 163

5.9 Keyword spotting results on SWB1 166

5.10 Relative speeds of optimised DMLS systems 176

5.11 Performance of a fully optimised DMLS system on Switchboard data177 5.12 Summary of key results 179

6.1 Summary of training data sets 186

6.2 Codes used to refer to model architectures 187

6.3 Summary of evaluation data sets 188 6.4 Stage 1 spotting rates for various model sets and database sizes 191

xiv

Trang 17

6.5 Equal error rates after keyword verification for various model setsand training database sizes 1946.6 Stage 1 spotting and stage 2 post verification results for S1I ex-periments 197

xv

Trang 19

List of Figures

2.1 An example of a Receiver Operating Characteristic curve 242.2 An example of a Detection Error Trade-off plot 252.3 Recognition grammar for HMM-based keyword spotting 272.4 Sample recognition grammar for small non-keyword vocabularykeyword spotting 292.5 System architecture for HMM keyword spotting using a SpeechBackground Model as the non-keyword model 322.6 System architecture for HMM keyword spotting using a compositenon-keyword model constructed from phone models 332.7 Constructing a recognition network for constrained vocabulary key-word spotting 382.8 An optimised constrained vocabulary keyword spotting recognitionnetwork (language model probabilities omitted) 392.9 An event spotting network for detecting occurrences of times [16] 402.10 Likelihood ratio based keyword occurrence verification with mul-tiple verifier fusion 452.11 Applying reverse dictionary searches to the detection of the wordACQUIRE in a phone stream 502.12 Example of indexed reverse dictionary searching for the detection

of the word ACQUIRE 52

xvii

Trang 20

QUIRE within a phone lattice 543.1 Confusability circle for the target word STOCK 593.2 Example of the shared subevent confusable acoustic region for thekeyword STOCK 633.3 Incorporating target word insertion penalty into HMM-based key-word spotting 693.4 DET plots for unnormalised and duration normalised output scorethresholding applied to SBM-KS 723.5 DET plots for duration normalised output score thresholding ap-plied to SBM-KS for keyword length dependent evaluation sets 753.6 DET plots for different target keyword lengths for SBM-KV onSwitchboard 1 evaluation sets 783.7 System architecture for MLP background model based KV 803.8 DET plots for SBM and MLP-SBM systems for 4-phone words 813.9 DET plots for SBM and MLP-SBM systems for 6-phone words 813.10 DET plots for SBM and MLP-SBM systems for 8-phone words 814.1 Controlling the degree of CAR region modeling dmin and dmax tuning 934.2 A N-class classifier approach to cohort word verification for thekeyword w and cohort word set R(w) 994.3 DET plot for best cohort word and SBM-KV systems on SWB14-phone length evaluation set 1094.4 DET plot for best cohort word and SBM-KV systems on SWB16-phone length evaluation set 1094.5 Equal error rate versus mean number of cohort words 1124.6 Trends in equal error rate with changes in cohort word set down-sampling size 115

xviii

Trang 21

4.7 Trends in equal error rate with changes in cohort word selection

range for 4-phone length cohort word KV 117

4.8 Trends in equal error rate with changes in cohort word selection range for 6-phone length cohort word KV 118

4.9 Trends in equal error rate with changes in cohort word selection range for 8-phone length cohort word KV 118

4.10 Trends in equal error rate with changes in MED cost parameters 120 4.11 Correlation between unfused system performances and fused sys-tem performances 127

4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131 4.13 Boxplot of log(EERs) for all evaluated architectures and phone-lengths 132

5.1 Segment of phone lattice for an instance of the word STOCK 142

5.2 Effect of lattice traversal token parameter 159

5.3 Trends in miss rate and FA/kw rate performance for various types of tuning 164

5.4 Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS systems evaluated on Switchboard 168

5.5 The relationship between cost matrices for subsequences 169

5.6 Demonstration of the MED prefix optimisation algorithm 170

6.1 Effect of training dataset size on speech recognition [24] 183

6.2 Trends in miss rate across training database size 190

6.3 Trends in FA/kw rate across training database size 190

6.4 DET plot for T16 experiments 1=T16S3E, 2=T16S2E, 3=T16S1E, 4=T16S2S, 5=T16S1S 193

6.5 DET plot for M16 experiments 1=M16S3E, 2=M16S2E, 3=M16S1E, 4=M16S2S, 5=M16S1S 193

xix

Trang 22

4=M32S2S, 5=M32S1S 1936.7 Trends in EER across training dataset size 1956.8 DET plot for S2S experiments 1=T16S2S, 2=M16S2S, 3=M32S2S 1966.9 DET plot for S1I experiments 1=T16S1I, 2=M16S1I, 3=M32S1I 1976.10 Extrapolations of Indonesian keyword spotting performance usinglarger sized databases 199A.1 Example of cost matrix calculated using Levenstein algorithm fortransforming deranged to hanged Cost of substitutions, deletionsand insertions all fixed at 1, cost of match fixed at 0 220

xx

Trang 23

List of Abbreviations

ADI Audio Document Indexing

CAR Confusable Acoustic Region

CLS Conventional Lattice-based Spotting

CMS Cepstral Mean Subtraction

CW Cohort Word

DAR Disparate Acoustic Region

DET Detection Error Trade-off

DMLS Dynamic Match Lattice Spotting

EER Equal Error Rate

FA False Alarm

GMM Gaussian Mixture Model

HMM Hidden Markov Model

IRDL Indexed Reverse Dictionary Lookup

PLP Perceptual Linear Prediction

RDL Reverse Dictionary Lookup

xxi

Trang 24

SBM Speech Background Model

SBM-KS Speech Background Model based Keyword SpottingSBM-KV Speech Background Model based Keyword VerificationSTT Speech-to-Text Transcription

SWB1 Switchboard-1

TAR Target Acoustic Region

WSJ1 Wall Street Journal 1

xxii

Trang 25

The work contained in this thesis has not been previously submitted for a degree

or diploma at any other higher educational institution To the best of my edge and belief, the thesis contains no material previously published or written

knowl-by another person except where due reference is made

Signed:

Date:

xxiii

Trang 27

Foremost I would like to acknowledge my Lord and Saviour Jesus Christ It is byHis grace that I was given the opportunity and necessary abilities to partake inthis research

I would also like to thank my beautiful wife, Melenie, who has been a constantsource of support and inspiration Your words of encouragement have seen methrough the more difficult and frustrating times of this work

To my supervisor, Professor Sridha Sridharan, I would like to offer my heartfeltgratitude for your unrelenting support in bringing this research to completion.Your positive words and guidance have been a true blessing

I would also like to offer a special thanks for the friendship of the members

of the QUT Speech Research Labs In particular, I would like to thank TerryMartin, Robbie Vogt, Michael Mason and Brendan Baker for their constructivecriticism as well as their constant joviality

Finally, I would like to thank my loving two families for believing in andsupporting me during this long venture, and my wonderful dogs for always giving

me a reason to smile

Kit ThambiratnamQueensland University of Technology

February 2005xxv

Trang 29

Primarily, keyword spotting is well suited to data-mining tasks that processlarge amounts of speech This is because keyword spotting requires significantlyless processing power than transcription, and can therefore run at considerablyfaster speeds Real-time stream monitoring is one such example where this isrequired These applications monitor audio in real-time and flag occurrences ofsegments of interest, such as news stories related to a specific topic Clearly,the majority of the stream does not require attention, and therefore a keywordspotting solution that simply detects occurrences of topical keywords will be moreefficient than a fully-fledged large vocabulary transcription engine.

Keyword spotting is also an excellent technology for audio search applications,

1

Trang 30

such as audio document indexing In particular, recent developments in KS cluding lattice-based searching and reverse dictionary lookup methods have madepossible the development of unrestricted vocabulary audio document databasesearch engines that can search hours of data in seconds.

in-However, many keyword spotting technologies are encumbered by poor tion performance or slow search speeds There is a trade-off between accuracyand speed that needs to be managed, and unfortunately to date, many practicalkeyword spotting applications are forced to sacrifice detection performance torealise the execution speeds required for commercial deployment One has only

detec-to use speech-recognition-enabled telephony services such as telephone banking

to conclude that these systems are far from perfect

Nevertheless, keyword spotting is a powerful and relevant technology Usedappropriately, a keyword spotting solution brings with it reduced computationalrequirements, increased scalability and potentially higher accuracies than a largevocabulary transcription system

1.1.1 Aims and Objectives

This work specifically examines the application of keyword spotting gies to two data mining tasks: real-time keyword monitoring and large audiodocument database indexing With the ever-increasing amounts of audio andmultimedia being generated daily, the ability to extract information from audiostreams at high speeds while maintaining good detection rates is paramount

technolo-A desirable feature of data mining applications is the support for unrestrictedvocabulary keyword queries However, a significant portion of past keyword spot-ting research has dealt primarily with restricted vocabulary methods Althoughthese approaches offer advantages in terms of detection and false alarm perfor-mance, they limit the flexibility of queries As such, this work concerns itself

Trang 31

1.1 Overview 3solely with the study of unrestricted vocabulary keyword spotting techniques.Data throughput is also another major consideration when dealing with largeamounts of data Although the cost of computing is constantly becoming cheaper,

it is nevertheless beneficial to run at high speeds This is particularly true foraudio indexing applications, where literally hundreds of hours may need to beinteractively searched by a user Unfortunately many published KS works neglect

to consider execution time during experimentation This research will thereforegive considerable attention to the issue of processing speed

The primary objectives of this thesis are as follows:

1 To review and investigate current state-of-the-art keyword spotting niques that are relevant to the tasks of real-time keyword monitoring andaudio document indexing

tech-2 To assess and evaluate the performance of these techniques with regards

to crucial performance metrics relevant to the target applications, and assuch, identify potential issues that need to be addressed

3 To investigate and develop novel techniques that can be used to improve theperformance of keyword spotting techniques for data mining applications

4 To investigate the application of keyword spotting technologies for English data mining

non-1.1.2 Research Scope

Keyword spotting encompasses a plethora of speech recognition research topicsthat unfortunately cannot be fully addressed in a single work As such, the scope

of this research was limited to issues that were directly related to the application

of keyword spotting to real-time keyword monitoring and audio document ing Additionally, the following restrictions and constraints were applied to this

Trang 32

1 Primarily this work concerns itself with the application of HMM-basedspeech recognition techniques to the keyword spotting task Alternate sta-tistical modeling approaches, such as neural network techniques, have beenproposed and demonstrated to be suitable for keyword spotting However,

it is believed that the HMM-based approach provides a greater degree offlexibility particularly with regards to unrestricted vocabulary tasks, and

as such is the modeling architecture of choice for this research

2 Experiments reported within this work are limited to single keyword tion Although most practical applications of keyword spotting use multi-word detection during a single pass, it is believed that research constrained

detec-to single keyword detection offers a number of advantages Primarily, itallows ease of comparison between results in this thesis and other publishedworks Additionally, the variability in performance due to different mix-tures of words within a multi-word keyword set can be avoided, therebyensuring greater consistency between experiments Finally, it is believedthat trends in single keyword spotting across methods will easily translate

to multi-word keyword spotting tasks, and as such, does not limit the value

of this research

An overview of the organisation of this thesis is given below:

Chapter 2 - A Review of Keyword Spotting presents a thorough review ofkeyword spotting and associated technologies A formal definition of thekeyword spotting problem is given, as well a discussion of its primary appli-cations This is followed by an overview of the key performance metrics that

Trang 33

1.2 Thesis Organisation 5are relevant to evaluating and understanding keyword spotting methodol-ogy A detailed review of KS literature is then presented covering the topics

of unrestricted and restricted spotting techniques, non-keyword modelingarchitectures, keyword verification and confidence scoring methods, and au-dio indexing approaches

Chapter 3 - HMM-based Spotting and Verification discusses and ates existing HMM-based keyword spotting and verification techniques.Such methods have a strong following within the keyword spotting com-munity However, to date, there has been little published work that com-pares the performances of the various approaches What little that hasbeen published has primarily focused on measuring performance for sim-plistic domains such as read microphone speech A number of HMM-basedtechniques are evaluated in this chapter and the strengths and weaknesses

evalu-of these methods are discussed

Chapter 4 - Cohort Word Verification proposes a novel keyword tion approach that combines high level linguistic information with cohort-based verification techniques to yield improved performance A number ofexperiments are reported on to measure the performance of this method forthe conversational telephone speech and read microphone speech domains.The results demonstrate that significant gains can be obtained particularlyfor the difficult task of short-word keyword verification In addition, exper-iments are performed using a fused architecture that combines cohort wordverification with traditional background model based verification Furthergains in performance are obtained using this approach

verifica-Chapter 5 - Dynamic Match Lattice Spotting proposes a novel audio dexing technique that is presented and evaluated in this chapter Althoughexisting unrestricted audio indexing methods are capable of very fast search

Trang 34

in-speeds, they are encumbered by very poor miss rate performance It is gued here that this poor miss rate is a result of inherent phone recognisererrors that are not accommodated for by these techniques As such, a newmethod of lattice-based searching is proposed that incorporates dynamicsequence matching methods to provide robustness against erroneous latticerealisations The results demonstrate that dramatic gains in performancecan be obtained while still maintaining extremely fast search speeds.

ar-Chapter 6 - Non-English Spotting studies the application of keyword ting technologies to non-English languages In particular, it examines theeffects of limited training data on keyword spotting performance The lack

spot-of availability spot-of non-English training data has greatly hindered the velopment of other speech technologies such as large vocabulary speechtranscribers However, keyword spotting is a significantly more constrainedtask, and therefore may be less affected by reduced amounts of trainingdata If so, this may allow the immediate development of speech tech-nologies for non-English languages without the need for the costly task ofcreating large training databases

de-Chapter 7 - Summary, Conclusions and Future Work presents the mary and conclusions of this work as well as a discussion of future researchdirections

This work has generated a number of novel contributions to the field of keywordspotting These are:

1 The development of the novel Cohort Word Verification technique This

Trang 35

1.4 List of Publications 7method combines high level linguistic knowledge with cohort-based veri-fication techniques to yield significant improvements particularly for theproblematic area of short-word keyword verification.

2 The use of multiple keyword verifier fusion, in particular applied to thecombination of cohort word verification with existing HMM-based tech-niques It is demonstrated that such fusion techniques allow the strengths

of individual verifiers to be combined to yield considerable improvements

in verification performance

3 The development of the novel Dynamic Match Lattice Spotting approach.This technique augments existing lattice-based audio indexing techniqueswith dynamic sequence matching to improve robustness to erroneous latticerealisation The resulting algorithm is capable of searching hours of speechusing seconds of processor time while maintaining good miss and false alarmrates

4 A detailed study of the effects of limited training data for keyword spotting,

as well as how this impacts the immediate development and deployment ofspeech technologies for non-English languages

The research presented in this thesis has resulted in the publication of a number

of fully referenced peer reviewed works

1 K Thambiratnam and S Sridharan “Isolated word verification using hort Word-level Verification”, in Proceedings of the European Conference

Co-on Speech CommunicatiCo-on and Technology (EUROSPEECH), (Geneva,Switzerland), 2003

Trang 36

2 K Thambiratnam and S Sridharan “A study on the effects of limitedtraining data for English, Spanish and Indonesian keyword spotting”, inProceedings of the 10th Australian International Conference on Speech Sci-ence and Technology (SST), (Sydney, Australia), 2004

3 T Martin, K Thambiratnam and S Sridharan “Target Structured CrossLanguage Model Refinement”, in Proceedings of the 10th Australian In-ternational Conference on Speech Science and Technology (SST), (Sydney,Australia), 2004

4 K Thambiratnam and S Sridharan, “Fusion of cohort-word and speechbackground model based confidence scores for improved keyword confidencescoring and verification”, in Proceedings of the IEEE 3rd International Con-ference on Sciences of Electronic, Technologies of Information and Telecom-munications, (Susa, Tunisia), 2005

5 K Thambiratnam and S Sridharan, “Dynamic match phone-lattice searchesfor very fast and accurate unrestricted vocabulary keyword spotting”, in Pro-ceedings of the 2005 IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), (Philadelphia, USA), 2005

Trang 37

Chapter 2

A Review of Keyword Spotting

This chapter presents a comprehensive review of keyword spotting technologies

to date Section 2.2 gives a formal definition of the keyword spotting problemand is followed by a discussion of the various applications of keyword spotting insection 2.3 A brief synopsis of the development of keyword spotting research isprovided in section 2.4 as well as a detailed description of how keyword spottingperformance is measured in section 2.5

Subsequent sections discuss the current methods of keyword spotting withrespect to their key applications Section 2.6 discusses a number of algorithmsfor unconstrained vocabulary keyword spotting This is followed by a description

of the various approaches to non-keyword modeling in section 2.7 Approaches toconstrained vocabulary keyword spotting are then presented in section 2.8 as well

as methods for keyword occurrence verification in section 2.9 Finally, methods

of applying KS to the task of audio document indexing are discussed in section2.10

9

Trang 38

2.2 The keyword spotting problem

Keyword spotting can be viewed as a special case of Speech-to-Text Transcription(STT), in which the transcription vocabulary is restricted to keywords of interestplus a non-keyword symbol that is used to represent all other words in the targetapplication domain

Let O be an observation sequence, V be the vocabulary of the target cation domain, Q be the set of keywords of interest and Ω be the non-keywordsymbol If STT is represented as the transformation W = T ranscribe(O, V ),where W = {w1, w2, } is the resulting hypothesised word sequence, then thekeyword spotting task can be defined as

f (T ail(W ), Q) otherwiseand T ail({xi}N

Trang 39

2.3 Applications of keyword spotting 11only interested in occurrences of a much smaller set of words defined by Q Giventhis simplification, a more practical and efficient formulation of keyword spottingis

KS(O, V, Q) = T ranscribe(O, g(Q)) (2.2)where g(Q) = Q ∪ {Ω}

This alternate approach requires transcription using a much smaller ulary of size |Q| + 1 Clearly, this is a considerably less computationally inten-sive task than transcription using the formulation in equation 2.1 However, itintroduces the additional burden of an acoustic model representation of the non-keyword symbol Ω Definition of the non-keyword symbol is one of the activeareas of keyword spotting research and is discussed further in section 2.7

Keyword spotting lends itself to a plethora of speech-enabled applications word spotting is particularly well suited to applications where large amounts ofspeech need to be processed This is because it offers a significant speed benefitover a large vocabulary STT approach Four major applications of this technol-ogy are keyword monitoring, audio document indexing, command control devicesand dialogue systems

Key-2.3.1 Keyword monitoring applications

Keyword monitoring applications are required to continuously monitor a time stream of audio and to flag any occurrences of a keyword in the queryset Specific keyword monitoring applications include telephone tapping, listeningdevice monitoring and broadcast monitoring

Trang 40

real-Telephone tapping and listening device monitoring are used extensively bysecurity organisations to detect criminal or malicious activity Keyword spottingprovides a fast and automatic solution to this task and potentially a higher de-tection accuracy then human monitoring, particularly when a very large number

of audio streams needs to be monitored However, these applications create aconsiderable challenge for keyword spotting because of the noisy nature of thespeech being monitored Telephone conversations may be plagued with signifi-cant background noise, multiple languages and even multiple speakers, providingchallenges for acoustic modeling Listening device audio may suffer from very lowsignal-to-noise ratios, a difficulty for any speech processing application

Broadcast monitoring is actively performed by commercial broadcast itoring companies to locate segments that may be of interest to a client Forexample, a senator may be interested in all news stories in which he or she ismentioned in - broadcast monitoring organisations provide such a service at afee A significant challenge of broadcast monitoring is the amount of audio thatneeds to be processed daily Broadcast monitoring clients may be interested instories from a comprehensive set of broadcast sources, including free-to-air tele-vision, cable-television, commercial radio and community radio It is easy to seethat the vast numbers of these combined with the fact that many of these sourcesbroadcast continually 24 hours a day, 7 days a week, makes broadcast monitoring

mon-a very dmon-atmon-a intensive problem

Keyword spotting provides an excellent solution to all these keyword itoring tasks Faster-than-real-time keyword spotting technologies are likely toprocess audio faster than a human processor Additionally the accuracy of anautomatic system is also likely to exceed that of a human processor since com-puters do not suffer from fatigue and mental distractions that plague a humanprocessor Keyword spotting is particularly well suited to the broadcast moni-toring task since audio quality in this domain is usually of much higher quality

Ngày đăng: 07/08/2017, 11:32

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN