Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword fication, Confidence Scoring, Speech Recognition, Utterance Verification Veri-i... Keyword spotting is particularlywe
Trang 1Speech and Audio Research Laboratory of the SAIVT program Centre for Built Environment and Engineering Research
ACOUSTIC KEYWORD SPOTTING
IN SPEECH WITH APPLICATIONS
AT QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE, QUEENSLAND
9 MARCH 2005
Trang 3Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword fication, Confidence Scoring, Speech Recognition, Utterance Verification
Veri-i
Trang 5Keyword Spotting is the task of detecting keywords of interest within ous speech The applications of this technology range from call centre dialoguesystems to covert speech surveillance devices Keyword spotting is particularlywell suited to data mining tasks such as real-time keyword monitoring and unre-stricted vocabulary audio document indexing However, to date, many keywordspotting approaches have suffered from poor detection rates, high false alarmrates, or slow execution times, thus reducing their commercial viability
continu-This work investigates the application of keyword spotting to data miningtasks The thesis makes a number of major contributions to the field of keywordspotting
The first major contribution is the development of a novel keyword verificationmethod named Cohort Word Verification This method combines high level lin-guistic information with cohort-based verification techniques to obtain dramaticimprovements in verification performance, in particular for the problematic shortduration target word class
The second major contribution is the development of a novel audio documentindexing technique named Dynamic Match Lattice Spotting This technique aug-ments lattice-based audio indexing principles with dynamic sequence matchingtechniques to provide robustness to erroneous lattice realisations The resultingalgorithm obtains significant improvement in detection rate over lattice-based
iii
Trang 6The third major contribution is the study of multiple verifier fusion for the task
of keyword verification The reported experiments demonstrate that substantialimprovements in verification performance can be obtained through the fusion
of multiple keyword verifiers The research focuses on combinations of speechbackground model based verifiers and cohort word verifiers
The final major contribution is a comprehensive study of the effects of limitedtraining data for keyword spotting This study is performed with consideration
as to how these effects impact the immediate development and deployment ofspeech technologies for non-English languages
iv
Trang 71.1 Overview 1
1.1.1 Aims and Objectives 2
1.1.2 Research Scope 3
1.2 Thesis Organisation 4
1.3 Major Contributions of this Research 6
1.4 List of Publications 7
2 A Review of Keyword Spotting 9 2.1 Introduction 9
v
Trang 82.3 Applications of keyword spotting 11
2.3.1 Keyword monitoring applications 11
2.3.2 Audio document indexing 13
2.3.3 Command controlled devices 13
2.3.4 Dialogue systems 14
2.4 The development of keyword spotting 15
2.4.1 Sliding window approaches 15
2.4.2 Non-keyword model approaches 16
2.4.3 Hidden Markov Model approaches 17
2.4.4 Further developments 17
2.5 Performance Measures 18
2.5.1 The reference and result sets 19
2.5.2 The hit operator 19
2.5.3 Miss rate 20
2.5.4 False alarm rate 21
2.5.5 False acceptance rate 21
2.5.6 Execution time 22
2.5.7 Figure of Merit 22
2.5.8 Equal Error Rate 23
2.5.9 Receiver Operating Characteristic Curves 24
2.5.10 Detection Error Trade-off Plots 25
2.6 Unconstrained vocabulary spotting 26
2.6.1 HMM-based approach 26
2.6.2 Neural Network Approaches 28
2.7 Approaches to non-keyword modeling 31
2.7.1 Speech background model 31
2.7.2 Phone models 33
vi
Trang 92.7.3 Uniform distribution 34
2.7.4 Online garbage model 34
2.8 Constrained vocabulary spotting 36
2.8.1 Language model approaches 36
2.8.2 Event spotting 39
2.9 Keyword verification 41
2.9.1 A formal definition 42
2.9.2 Combining keyword spotting and verification 42
2.9.3 The problem of short duration keywords 43
2.9.4 Likelihood ratio based approaches 43
2.9.5 Alternate Information Sources 46
2.10 Audio Document Indexing 47
2.10.1 Limitations of the Speech-to-Text Transcription approach 48
2.10.2 Reverse dictionary lookup searches 49
2.10.3 Indexed reverse dictionary lookup searches 51
2.10.4 Lattice based searches 53
3 HMM-based spotting and verification 57 3.1 Introduction 57
3.2 The confusability circle framework 58
3.3 Analysis of non-keyword models 60
3.3.1 All-speech models 60
3.3.2 SBM methods 61
3.3.3 Phone-set methods 62
3.3.4 Target-word-excluding methods 62
3.4 Evaluation of keyword spotting techniques 63
3.4.1 Experiment setup 64
vii
Trang 103.5 Tuning the phone set non-keyword model 68
3.6 Output score thresholding for SBM spotting 70
3.7 Performance across keyword length 72
3.7.1 Evaluation sets 73
3.7.2 Results 73
3.8 HMM-based keyword verification 74
3.8.1 Evaluation set 76
3.8.2 Evaluation procedure 77
3.8.3 Results 77
3.9 Discriminative background model KV 79
3.9.1 System architecture 79
3.9.2 Results 80
3.10 Summary and Conclusions 82
4 Cohort word keyword verification 85 4.1 Introduction 85
4.2 Foundational concepts 87
4.2.1 Cohort-based scoring 87
4.2.2 The use of language information 88
4.3 Overview of the cohort word technique 90
4.4 Cohort word set construction 92
4.4.1 The choice of dmin and dmax 92
4.4.2 Cohort word set downsampling 94
4.4.3 Distance function 94
4.5 Classification approach 96
4.5.1 2-class classification approach 96
viii
Trang 114.5.2 Hybrid N-class approach 98
4.6 Summary of the cohort word algorithm 100
4.7 Comparison of classifier approaches 101
4.7.1 Evaluation set 102
4.7.2 Recogniser parameters 103
4.7.3 Cohort word selection 103
4.7.4 Evaluation procedure 104
4.7.5 Results 104
4.8 Performance across target keyword length 106
4.8.1 Evaluation set 106
4.8.2 Recogniser parameters 107
4.8.3 Results 108
4.8.4 Analysis of poor 8-phone performance 110
4.8.5 Conclusions 111
4.9 Effects of selection parameters 113
4.9.1 Cohort word set downsampling 114
4.9.2 Cohort word selection range 116
4.9.3 MED cost parameters 119
4.9.4 Conclusions 121
4.10 Fused cohort word systems 122
4.10.1 Training dataset 123
4.10.2 Neural network architecture 123
4.10.3 Experimental procedure 123
4.10.4 Baseline unfused results 124
4.10.5 Fused SBM-CW experiments 125
4.10.6 Fused CW-CW experiments 128
4.10.7 Comparison of fused and unfused systems 129
4.11 Conclusions and Summary 133
ix
Trang 125.1 Introduction 137
5.2 Motivation 138
5.3 Dynamic Match Lattice Spotting method 140
5.3.1 Basic method 143
5.3.2 Optimised Dynamic Match Lattice Search 145
5.4 Evaluation of DMLS performance 146
5.4.1 Evaluation set 146
5.4.2 Recogniser parameters 147
5.4.3 Lattice building 147
5.4.4 Query-time processing 148
5.4.5 Baseline systems 149
5.4.6 Evaluation procedure 150
5.4.7 Results 150
5.5 Analysis of dynamic match rules 152
5.5.1 System configurations 153
5.5.2 Results 154
5.6 Analysis of DMLS algorithm parameters 156
5.6.1 Number of lattice generation tokens 157
5.6.2 Pruning beamwidth 158
5.6.3 Number of lattice traversal tokens 159
5.6.4 MED cost threshold 160
5.6.5 Tuned systems 162
5.6.6 Conclusions 163
5.7 Conversational telephone speech experiments 165
5.7.1 Evaluation set 165
5.7.2 Recogniser parameters 165
x
Trang 135.7.3 Results 166
5.8 Non-destructive optimisations 168
5.8.1 Prefix sequence optimisation 169
5.8.2 Early stopping optimisation 171
5.8.3 Combining optimisations 173
5.9 Optimised system timings 174
5.9.1 Experimental procedure 175
5.9.2 Results 176
5.10 Summary 177
6 Non-English Spotting 181 6.1 Introduction 181
6.2 The issue of limited resources 182
6.3 The role of keyword spotting 184
6.4 Experiment setup 184
6.4.1 Database design 185
6.4.2 Model architectures 186
6.4.3 Evaluation set design 188
6.4.4 Evaluation procedure 188
6.5 English and Spanish stage 1 evaluations 189
6.6 English and Spanish post keyword verification 192
6.7 Indonesian spotting and verification 197
6.8 Extrapolating Indonesian performance 198
6.9 Summary and Conclusions 200
7 Summary, Conclusions and Future Work 203 7.1 HMM-based Spotting and Verification 203
7.1.1 Conclusions 203
xi
Trang 147.2 Cohort Word Verification 205
7.2.1 Conclusions 205
7.2.2 Future Work 206
7.3 Dynamic Match Lattice Spotting 206
7.3.1 Conclusions 207
7.3.2 Future Work 208
7.4 Non-English Spotting 208
7.4.1 Conclusions 208
7.5 Final Comments 209
Bibliography 210 A The Levenstein Distance 217 A.1 Introduction 217
A.2 Applications 217
A.3 Algorithm 218
xii
Trang 15List of Tables
3.1 Keyword spotting performance of baseline systems on Switchboard
1 data 663.2 Effect of target word insertion penalty on PM-KS performance 693.3 Equal error rates of unnormalised and duration normalised outputscore thresholding applied to SBM-KS 713.4 Details of phone-length dependent evaluation sets 733.5 SBM-KS performance on Switchboard 1 data for different phone-length target words 743.6 Statistics for keyword verification evaluation sets 773.7 Equal error rates for SBM-based keyword verification 783.8 Equal error rates for SBM and MLP-SBM keyword verification 82
4.1 Evaluated cohort word selection parameters 1034.2 Performance of selected cohort word KV systems on TIMIT eval-uation set Cohort word systems are qualified with the appropri-ate cohort word selection parameters using a tag in the format{dmin, dmax, ψd, ψi} 1054.3 Performance of SBM-KV and selected cohort word systems on theSWB1 evaluation sets Cohort word selection parameters are spec-ified with each system in the format {dmin, dmax, ψd, ψi} 108
xiii
Trang 16in the 3 best performing cohort word KV methods for the SWB1
evaluation set 111
4.5 Performance of baseline SBM-KV and best cohort word systems on the SWB1 evaluation sets 124
4.6 Performance of the best fused SBM-cohort systems on the SWB1 evaluation sets 125
4.7 Performance of the best fused cohort-cohort systems on the SWB1 evaluation sets 128
4.8 Correlation analysis of fused EER and individual unfused EER 130
4.9 Summary of best performing systems 135
5.1 Phone substitution costs for DMLS 149
5.2 Baseline keyword spotting results evaluated on TIMIT 151
5.3 TIMIT performance when isolating various DP rules 154
5.4 Effect of adjusting number of lattice generation tokens 157
5.5 Effect of adjusting pruning beamwidth 158
5.6 Effect of adjusting number of traversal tokens 160
5.7 Effect of adjusting MED cost threshold Smax 161
5.8 Optimised DMLS configurations evaluated on TIMIT 163
5.9 Keyword spotting results on SWB1 166
5.10 Relative speeds of optimised DMLS systems 176
5.11 Performance of a fully optimised DMLS system on Switchboard data177 5.12 Summary of key results 179
6.1 Summary of training data sets 186
6.2 Codes used to refer to model architectures 187
6.3 Summary of evaluation data sets 188 6.4 Stage 1 spotting rates for various model sets and database sizes 191
xiv
Trang 176.5 Equal error rates after keyword verification for various model setsand training database sizes 1946.6 Stage 1 spotting and stage 2 post verification results for S1I ex-periments 197
xv
Trang 19List of Figures
2.1 An example of a Receiver Operating Characteristic curve 242.2 An example of a Detection Error Trade-off plot 252.3 Recognition grammar for HMM-based keyword spotting 272.4 Sample recognition grammar for small non-keyword vocabularykeyword spotting 292.5 System architecture for HMM keyword spotting using a SpeechBackground Model as the non-keyword model 322.6 System architecture for HMM keyword spotting using a compositenon-keyword model constructed from phone models 332.7 Constructing a recognition network for constrained vocabulary key-word spotting 382.8 An optimised constrained vocabulary keyword spotting recognitionnetwork (language model probabilities omitted) 392.9 An event spotting network for detecting occurrences of times [16] 402.10 Likelihood ratio based keyword occurrence verification with mul-tiple verifier fusion 452.11 Applying reverse dictionary searches to the detection of the wordACQUIRE in a phone stream 502.12 Example of indexed reverse dictionary searching for the detection
of the word ACQUIRE 52
xvii
Trang 20QUIRE within a phone lattice 543.1 Confusability circle for the target word STOCK 593.2 Example of the shared subevent confusable acoustic region for thekeyword STOCK 633.3 Incorporating target word insertion penalty into HMM-based key-word spotting 693.4 DET plots for unnormalised and duration normalised output scorethresholding applied to SBM-KS 723.5 DET plots for duration normalised output score thresholding ap-plied to SBM-KS for keyword length dependent evaluation sets 753.6 DET plots for different target keyword lengths for SBM-KV onSwitchboard 1 evaluation sets 783.7 System architecture for MLP background model based KV 803.8 DET plots for SBM and MLP-SBM systems for 4-phone words 813.9 DET plots for SBM and MLP-SBM systems for 6-phone words 813.10 DET plots for SBM and MLP-SBM systems for 8-phone words 814.1 Controlling the degree of CAR region modeling dmin and dmax tuning 934.2 A N-class classifier approach to cohort word verification for thekeyword w and cohort word set R(w) 994.3 DET plot for best cohort word and SBM-KV systems on SWB14-phone length evaluation set 1094.4 DET plot for best cohort word and SBM-KV systems on SWB16-phone length evaluation set 1094.5 Equal error rate versus mean number of cohort words 1124.6 Trends in equal error rate with changes in cohort word set down-sampling size 115
xviii
Trang 214.7 Trends in equal error rate with changes in cohort word selection
range for 4-phone length cohort word KV 117
4.8 Trends in equal error rate with changes in cohort word selection range for 6-phone length cohort word KV 118
4.9 Trends in equal error rate with changes in cohort word selection range for 8-phone length cohort word KV 118
4.10 Trends in equal error rate with changes in MED cost parameters 120 4.11 Correlation between unfused system performances and fused sys-tem performances 127
4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131 4.13 Boxplot of log(EERs) for all evaluated architectures and phone-lengths 132
5.1 Segment of phone lattice for an instance of the word STOCK 142
5.2 Effect of lattice traversal token parameter 159
5.3 Trends in miss rate and FA/kw rate performance for various types of tuning 164
5.4 Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS systems evaluated on Switchboard 168
5.5 The relationship between cost matrices for subsequences 169
5.6 Demonstration of the MED prefix optimisation algorithm 170
6.1 Effect of training dataset size on speech recognition [24] 183
6.2 Trends in miss rate across training database size 190
6.3 Trends in FA/kw rate across training database size 190
6.4 DET plot for T16 experiments 1=T16S3E, 2=T16S2E, 3=T16S1E, 4=T16S2S, 5=T16S1S 193
6.5 DET plot for M16 experiments 1=M16S3E, 2=M16S2E, 3=M16S1E, 4=M16S2S, 5=M16S1S 193
xix
Trang 224=M32S2S, 5=M32S1S 1936.7 Trends in EER across training dataset size 1956.8 DET plot for S2S experiments 1=T16S2S, 2=M16S2S, 3=M32S2S 1966.9 DET plot for S1I experiments 1=T16S1I, 2=M16S1I, 3=M32S1I 1976.10 Extrapolations of Indonesian keyword spotting performance usinglarger sized databases 199A.1 Example of cost matrix calculated using Levenstein algorithm fortransforming deranged to hanged Cost of substitutions, deletionsand insertions all fixed at 1, cost of match fixed at 0 220
xx
Trang 23List of Abbreviations
ADI Audio Document Indexing
CAR Confusable Acoustic Region
CLS Conventional Lattice-based Spotting
CMS Cepstral Mean Subtraction
CW Cohort Word
DAR Disparate Acoustic Region
DET Detection Error Trade-off
DMLS Dynamic Match Lattice Spotting
EER Equal Error Rate
FA False Alarm
GMM Gaussian Mixture Model
HMM Hidden Markov Model
IRDL Indexed Reverse Dictionary Lookup
PLP Perceptual Linear Prediction
RDL Reverse Dictionary Lookup
xxi
Trang 24SBM Speech Background Model
SBM-KS Speech Background Model based Keyword SpottingSBM-KV Speech Background Model based Keyword VerificationSTT Speech-to-Text Transcription
SWB1 Switchboard-1
TAR Target Acoustic Region
WSJ1 Wall Street Journal 1
xxii
Trang 25The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution To the best of my edge and belief, the thesis contains no material previously published or written
knowl-by another person except where due reference is made
Signed:
Date:
xxiii
Trang 27Foremost I would like to acknowledge my Lord and Saviour Jesus Christ It is byHis grace that I was given the opportunity and necessary abilities to partake inthis research
I would also like to thank my beautiful wife, Melenie, who has been a constantsource of support and inspiration Your words of encouragement have seen methrough the more difficult and frustrating times of this work
To my supervisor, Professor Sridha Sridharan, I would like to offer my heartfeltgratitude for your unrelenting support in bringing this research to completion.Your positive words and guidance have been a true blessing
I would also like to offer a special thanks for the friendship of the members
of the QUT Speech Research Labs In particular, I would like to thank TerryMartin, Robbie Vogt, Michael Mason and Brendan Baker for their constructivecriticism as well as their constant joviality
Finally, I would like to thank my loving two families for believing in andsupporting me during this long venture, and my wonderful dogs for always giving
me a reason to smile
Kit ThambiratnamQueensland University of Technology
February 2005xxv
Trang 29Primarily, keyword spotting is well suited to data-mining tasks that processlarge amounts of speech This is because keyword spotting requires significantlyless processing power than transcription, and can therefore run at considerablyfaster speeds Real-time stream monitoring is one such example where this isrequired These applications monitor audio in real-time and flag occurrences ofsegments of interest, such as news stories related to a specific topic Clearly,the majority of the stream does not require attention, and therefore a keywordspotting solution that simply detects occurrences of topical keywords will be moreefficient than a fully-fledged large vocabulary transcription engine.
Keyword spotting is also an excellent technology for audio search applications,
1
Trang 30such as audio document indexing In particular, recent developments in KS cluding lattice-based searching and reverse dictionary lookup methods have madepossible the development of unrestricted vocabulary audio document databasesearch engines that can search hours of data in seconds.
in-However, many keyword spotting technologies are encumbered by poor tion performance or slow search speeds There is a trade-off between accuracyand speed that needs to be managed, and unfortunately to date, many practicalkeyword spotting applications are forced to sacrifice detection performance torealise the execution speeds required for commercial deployment One has only
detec-to use speech-recognition-enabled telephony services such as telephone banking
to conclude that these systems are far from perfect
Nevertheless, keyword spotting is a powerful and relevant technology Usedappropriately, a keyword spotting solution brings with it reduced computationalrequirements, increased scalability and potentially higher accuracies than a largevocabulary transcription system
1.1.1 Aims and Objectives
This work specifically examines the application of keyword spotting gies to two data mining tasks: real-time keyword monitoring and large audiodocument database indexing With the ever-increasing amounts of audio andmultimedia being generated daily, the ability to extract information from audiostreams at high speeds while maintaining good detection rates is paramount
technolo-A desirable feature of data mining applications is the support for unrestrictedvocabulary keyword queries However, a significant portion of past keyword spot-ting research has dealt primarily with restricted vocabulary methods Althoughthese approaches offer advantages in terms of detection and false alarm perfor-mance, they limit the flexibility of queries As such, this work concerns itself
Trang 311.1 Overview 3solely with the study of unrestricted vocabulary keyword spotting techniques.Data throughput is also another major consideration when dealing with largeamounts of data Although the cost of computing is constantly becoming cheaper,
it is nevertheless beneficial to run at high speeds This is particularly true foraudio indexing applications, where literally hundreds of hours may need to beinteractively searched by a user Unfortunately many published KS works neglect
to consider execution time during experimentation This research will thereforegive considerable attention to the issue of processing speed
The primary objectives of this thesis are as follows:
1 To review and investigate current state-of-the-art keyword spotting niques that are relevant to the tasks of real-time keyword monitoring andaudio document indexing
tech-2 To assess and evaluate the performance of these techniques with regards
to crucial performance metrics relevant to the target applications, and assuch, identify potential issues that need to be addressed
3 To investigate and develop novel techniques that can be used to improve theperformance of keyword spotting techniques for data mining applications
4 To investigate the application of keyword spotting technologies for English data mining
non-1.1.2 Research Scope
Keyword spotting encompasses a plethora of speech recognition research topicsthat unfortunately cannot be fully addressed in a single work As such, the scope
of this research was limited to issues that were directly related to the application
of keyword spotting to real-time keyword monitoring and audio document ing Additionally, the following restrictions and constraints were applied to this
Trang 321 Primarily this work concerns itself with the application of HMM-basedspeech recognition techniques to the keyword spotting task Alternate sta-tistical modeling approaches, such as neural network techniques, have beenproposed and demonstrated to be suitable for keyword spotting However,
it is believed that the HMM-based approach provides a greater degree offlexibility particularly with regards to unrestricted vocabulary tasks, and
as such is the modeling architecture of choice for this research
2 Experiments reported within this work are limited to single keyword tion Although most practical applications of keyword spotting use multi-word detection during a single pass, it is believed that research constrained
detec-to single keyword detection offers a number of advantages Primarily, itallows ease of comparison between results in this thesis and other publishedworks Additionally, the variability in performance due to different mix-tures of words within a multi-word keyword set can be avoided, therebyensuring greater consistency between experiments Finally, it is believedthat trends in single keyword spotting across methods will easily translate
to multi-word keyword spotting tasks, and as such, does not limit the value
of this research
An overview of the organisation of this thesis is given below:
Chapter 2 - A Review of Keyword Spotting presents a thorough review ofkeyword spotting and associated technologies A formal definition of thekeyword spotting problem is given, as well a discussion of its primary appli-cations This is followed by an overview of the key performance metrics that
Trang 331.2 Thesis Organisation 5are relevant to evaluating and understanding keyword spotting methodol-ogy A detailed review of KS literature is then presented covering the topics
of unrestricted and restricted spotting techniques, non-keyword modelingarchitectures, keyword verification and confidence scoring methods, and au-dio indexing approaches
Chapter 3 - HMM-based Spotting and Verification discusses and ates existing HMM-based keyword spotting and verification techniques.Such methods have a strong following within the keyword spotting com-munity However, to date, there has been little published work that com-pares the performances of the various approaches What little that hasbeen published has primarily focused on measuring performance for sim-plistic domains such as read microphone speech A number of HMM-basedtechniques are evaluated in this chapter and the strengths and weaknesses
evalu-of these methods are discussed
Chapter 4 - Cohort Word Verification proposes a novel keyword tion approach that combines high level linguistic information with cohort-based verification techniques to yield improved performance A number ofexperiments are reported on to measure the performance of this method forthe conversational telephone speech and read microphone speech domains.The results demonstrate that significant gains can be obtained particularlyfor the difficult task of short-word keyword verification In addition, exper-iments are performed using a fused architecture that combines cohort wordverification with traditional background model based verification Furthergains in performance are obtained using this approach
verifica-Chapter 5 - Dynamic Match Lattice Spotting proposes a novel audio dexing technique that is presented and evaluated in this chapter Althoughexisting unrestricted audio indexing methods are capable of very fast search
Trang 34in-speeds, they are encumbered by very poor miss rate performance It is gued here that this poor miss rate is a result of inherent phone recognisererrors that are not accommodated for by these techniques As such, a newmethod of lattice-based searching is proposed that incorporates dynamicsequence matching methods to provide robustness against erroneous latticerealisations The results demonstrate that dramatic gains in performancecan be obtained while still maintaining extremely fast search speeds.
ar-Chapter 6 - Non-English Spotting studies the application of keyword ting technologies to non-English languages In particular, it examines theeffects of limited training data on keyword spotting performance The lack
spot-of availability spot-of non-English training data has greatly hindered the velopment of other speech technologies such as large vocabulary speechtranscribers However, keyword spotting is a significantly more constrainedtask, and therefore may be less affected by reduced amounts of trainingdata If so, this may allow the immediate development of speech tech-nologies for non-English languages without the need for the costly task ofcreating large training databases
de-Chapter 7 - Summary, Conclusions and Future Work presents the mary and conclusions of this work as well as a discussion of future researchdirections
This work has generated a number of novel contributions to the field of keywordspotting These are:
1 The development of the novel Cohort Word Verification technique This
Trang 351.4 List of Publications 7method combines high level linguistic knowledge with cohort-based veri-fication techniques to yield significant improvements particularly for theproblematic area of short-word keyword verification.
2 The use of multiple keyword verifier fusion, in particular applied to thecombination of cohort word verification with existing HMM-based tech-niques It is demonstrated that such fusion techniques allow the strengths
of individual verifiers to be combined to yield considerable improvements
in verification performance
3 The development of the novel Dynamic Match Lattice Spotting approach.This technique augments existing lattice-based audio indexing techniqueswith dynamic sequence matching to improve robustness to erroneous latticerealisation The resulting algorithm is capable of searching hours of speechusing seconds of processor time while maintaining good miss and false alarmrates
4 A detailed study of the effects of limited training data for keyword spotting,
as well as how this impacts the immediate development and deployment ofspeech technologies for non-English languages
The research presented in this thesis has resulted in the publication of a number
of fully referenced peer reviewed works
1 K Thambiratnam and S Sridharan “Isolated word verification using hort Word-level Verification”, in Proceedings of the European Conference
Co-on Speech CommunicatiCo-on and Technology (EUROSPEECH), (Geneva,Switzerland), 2003
Trang 362 K Thambiratnam and S Sridharan “A study on the effects of limitedtraining data for English, Spanish and Indonesian keyword spotting”, inProceedings of the 10th Australian International Conference on Speech Sci-ence and Technology (SST), (Sydney, Australia), 2004
3 T Martin, K Thambiratnam and S Sridharan “Target Structured CrossLanguage Model Refinement”, in Proceedings of the 10th Australian In-ternational Conference on Speech Science and Technology (SST), (Sydney,Australia), 2004
4 K Thambiratnam and S Sridharan, “Fusion of cohort-word and speechbackground model based confidence scores for improved keyword confidencescoring and verification”, in Proceedings of the IEEE 3rd International Con-ference on Sciences of Electronic, Technologies of Information and Telecom-munications, (Susa, Tunisia), 2005
5 K Thambiratnam and S Sridharan, “Dynamic match phone-lattice searchesfor very fast and accurate unrestricted vocabulary keyword spotting”, in Pro-ceedings of the 2005 IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), (Philadelphia, USA), 2005
Trang 37Chapter 2
A Review of Keyword Spotting
This chapter presents a comprehensive review of keyword spotting technologies
to date Section 2.2 gives a formal definition of the keyword spotting problemand is followed by a discussion of the various applications of keyword spotting insection 2.3 A brief synopsis of the development of keyword spotting research isprovided in section 2.4 as well as a detailed description of how keyword spottingperformance is measured in section 2.5
Subsequent sections discuss the current methods of keyword spotting withrespect to their key applications Section 2.6 discusses a number of algorithmsfor unconstrained vocabulary keyword spotting This is followed by a description
of the various approaches to non-keyword modeling in section 2.7 Approaches toconstrained vocabulary keyword spotting are then presented in section 2.8 as well
as methods for keyword occurrence verification in section 2.9 Finally, methods
of applying KS to the task of audio document indexing are discussed in section2.10
9
Trang 382.2 The keyword spotting problem
Keyword spotting can be viewed as a special case of Speech-to-Text Transcription(STT), in which the transcription vocabulary is restricted to keywords of interestplus a non-keyword symbol that is used to represent all other words in the targetapplication domain
Let O be an observation sequence, V be the vocabulary of the target cation domain, Q be the set of keywords of interest and Ω be the non-keywordsymbol If STT is represented as the transformation W = T ranscribe(O, V ),where W = {w1, w2, } is the resulting hypothesised word sequence, then thekeyword spotting task can be defined as
f (T ail(W ), Q) otherwiseand T ail({xi}N
Trang 392.3 Applications of keyword spotting 11only interested in occurrences of a much smaller set of words defined by Q Giventhis simplification, a more practical and efficient formulation of keyword spottingis
KS(O, V, Q) = T ranscribe(O, g(Q)) (2.2)where g(Q) = Q ∪ {Ω}
This alternate approach requires transcription using a much smaller ulary of size |Q| + 1 Clearly, this is a considerably less computationally inten-sive task than transcription using the formulation in equation 2.1 However, itintroduces the additional burden of an acoustic model representation of the non-keyword symbol Ω Definition of the non-keyword symbol is one of the activeareas of keyword spotting research and is discussed further in section 2.7
Keyword spotting lends itself to a plethora of speech-enabled applications word spotting is particularly well suited to applications where large amounts ofspeech need to be processed This is because it offers a significant speed benefitover a large vocabulary STT approach Four major applications of this technol-ogy are keyword monitoring, audio document indexing, command control devicesand dialogue systems
Key-2.3.1 Keyword monitoring applications
Keyword monitoring applications are required to continuously monitor a time stream of audio and to flag any occurrences of a keyword in the queryset Specific keyword monitoring applications include telephone tapping, listeningdevice monitoring and broadcast monitoring
Trang 40real-Telephone tapping and listening device monitoring are used extensively bysecurity organisations to detect criminal or malicious activity Keyword spottingprovides a fast and automatic solution to this task and potentially a higher de-tection accuracy then human monitoring, particularly when a very large number
of audio streams needs to be monitored However, these applications create aconsiderable challenge for keyword spotting because of the noisy nature of thespeech being monitored Telephone conversations may be plagued with signifi-cant background noise, multiple languages and even multiple speakers, providingchallenges for acoustic modeling Listening device audio may suffer from very lowsignal-to-noise ratios, a difficulty for any speech processing application
Broadcast monitoring is actively performed by commercial broadcast itoring companies to locate segments that may be of interest to a client Forexample, a senator may be interested in all news stories in which he or she ismentioned in - broadcast monitoring organisations provide such a service at afee A significant challenge of broadcast monitoring is the amount of audio thatneeds to be processed daily Broadcast monitoring clients may be interested instories from a comprehensive set of broadcast sources, including free-to-air tele-vision, cable-television, commercial radio and community radio It is easy to seethat the vast numbers of these combined with the fact that many of these sourcesbroadcast continually 24 hours a day, 7 days a week, makes broadcast monitoring
mon-a very dmon-atmon-a intensive problem
Keyword spotting provides an excellent solution to all these keyword itoring tasks Faster-than-real-time keyword spotting technologies are likely toprocess audio faster than a human processor Additionally the accuracy of anautomatic system is also likely to exceed that of a human processor since com-puters do not suffer from fatigue and mental distractions that plague a humanprocessor Keyword spotting is particularly well suited to the broadcast moni-toring task since audio quality in this domain is usually of much higher quality