Tài liệu Digital Signal Processing Handbook P47 pptx

Juang Bell Laboratories Lucent Technologies 47.1 Introduction 47.2 Characterization of Speech Recognition Systems 47.3 Sources of Variability of Speech 47.4 Approaches to ASR by Machine

Trang 1

Lawrence R Rabiner, et Al “Speech Recognition by Machine.”

2000 CRC Press LLC <http://www.engnetbase.com>.

Trang 2

Speech Recognition by Machine

Lawrence R Rabiner

AT&T Labs — Research

B H Juang

Bell Laboratories

Lucent Technologies

47.1 Introduction 47.2 Characterization of Speech Recognition Systems 47.3 Sources of Variability of Speech

47.4 Approaches to ASR by Machine

The Acoustic-Phonetic Approach [ 1 ] •“Pattern-Matching”

Approach [ 2 ] •Artificial Intelligence Approach [3,4] 47.5 Speech Recognition by Pattern Matching

Speech Analysis•Pattern Training•Pattern Matching• De-cision Strategy•Results of Isolated Word Recognition

47.6 Connected Word Recognition

Performance of Connected Word Recognizers

47.7 Continuous Speech Recognition

Sub-Word Speech Units and Acoustic Modeling•Word Mod-eling From Sub-Word Units•Language Modeling Within the Recognizer •Performance of Continuous Speech Recognizers 47.8 Speech Recognition System Issues

Robust Speech Recognition [ 18 ] •Speaker Adaptation [25]

• Keyword Spotting [26] and Utterance Verification [27]•

Barge-In

47.9 Practical Issues in Speech Recognition 47.10 ASR Applications

References

47.1 Introduction

Over the past several decades a need has arisen to enable humans to communicate with machines in order to control their actions or to obtain information Initial attempts at providing human-machine communications led to the development of the keyboard, the mouse, the trackball, the touch screen, and the joy stick However, none of these communication devices provides the richness or the ease

of use of speech which has been the most natural form of communication between humans for tens

of centuries Hence, a need has arisen to provide a voice interface between humans and machines This need has been met, to a limited extent, by speech processing systems which enable a machine

to speak (speech synthesis systems) and which enable a machine to understand (speech recognition systems) human speech We concentrate on speech recognition systems in this section

Speech recognition by machine refers to the capability of a machine to convert human speech to

a textual form, providing a transcription or interpretation of everything the human speaks while the machine is listening This capability is required for tasks in which the human is controlling the actions of the machine using only limited speaking capability, e.g., while speaking simple commands

or sequences of words from a limited vocabulary (e.g., digit sequences for a telephone number) In

Trang 3

the more general case, usually referred to as speech understanding, the machine need only recognize

a limited subset of the user input speech, namely the speech that specifies enough about the action requested so that the machine can either respond appropriately, or initiate some action in response

to what was understood

Speech recognition systems have been deployed in applications ranging from control of desktop computers, to telecommunication services, to business services, and have achieved varying degrees

of success and commercialization

In this section we discuss a range of issues involved in the design and implementation of speech recognition systems

47.2 Characterization of Speech Recognition Systems

A number of issues define the technology of speech recognition systems These include:

1 The manner in which a user speaks to the machine There are generally three modes of speaking, including:

• isolated word (or phrase) mode in which the user speaks individual words (or phrases) drawn from a specified vocabulary;

• connected word mode in which the user speaks fluent speech consisting entirely of words from a specified vocabulary (e.g., telephone numbers);

• continuous speech mode in which the user can speak fluently from a large (often unlimited) vocabulary

2 The size of the recognition vocabulary, including:

• small vocabulary systems which provide recognition capability for up to 100 words;

• medium vocabulary systems which provide recognition capability for from 100 to

1000 words;

• large vocabulary systems which provide recognition capability for over 1000 words

3 The knowledge of the user’s speech patterns, including:

• speaker dependent systems which have been custom tailored to each individual talker;

• speaker independent systems which work on broad populations of talkers, most of which the system has never encountered or adapted to;

• speaker adaptive systems which customize their knowledge to each individual user over time while the system is in use

4 The amount of acoustic and lexical knowledge used in the system, including:

• simple acoustic systems which have no linguistic knowledge;

• systems which integrate acoustic and linguistic knowledge, where the linguistic knowledge is generally represented via syntactical and semantic constraints on the output of the recognition system

5 The degree of dialogue between the human and the machine, including:

• one-way (passive) communication in which each user spoken input is acted upon;

• system-driven dialog systems in which the system is the sole initiator of a dialog, requesting information from the user via verbal input;

Trang 4

• natural dialogue systems in which the machine conducts a conversation with the speaker, solicits inputs, acts in response to user inputs, or even tries to clarify am-biguity in the conversation

47.3 Sources of Variability of Speech

Speech recognition by machine is inherently difficult because of the variability in the signal Sources

of this variability include:

1 Within-speaker variability in maintaining consistent pronunciation and use of words and phrases

2 Across-speaker variability due to physiological differences (e.g., different vocal tract lengths) regional accents, foreign languages, etc

3 Transducer variability while speaking over different microphones/telephone handsets

4 Variability introduced by the transmission system (the media through which speech is transmitted, telecommunication networks, cellular phones, etc.)

5 Variability in the speaking environment, including extraneous conversations and acoustic background events (e.g., noise, door slams)

47.4 Approaches to ASR by Machine

47.4.1 The Acoustic-Phonetic Approach [ 1 ]

The earliest approaches to speech recognition were based on finding speech sounds and providing ap-propriate labels to these sounds This is the basis of the acoustic-phonetic approach which postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language, and that these units are broadly characterized by a set of acoustic properties that are manifest in the speech signal over time Even though the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds (the so-called coarticulation), it is assumed in the “acoustic-phonetic approach” that the rules governing the variability are straightforward and can be readily learned (by

a machine) The first step in the acoustic-phonetic approach is a segmentation and labeling phase

in which the speech signal is segmented into stable acoustic regions, followed by attaching one or more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of the speech (see Fig.47.1) The second step attempts to determine a valid word (or string of words) from the phonetic label sequences produced in the first step In the validation process, linguistic constraints of the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in order to access the lexicon for word decoding based on the phoneme lattice The acoustic-phonetic approach has not been widely used in most commercial applications

47.4.2 “Pattern-Matching” Approach [ 2 ]

The “pattern-matching approach” involves two essential steps, namely, pattern training and pattern comparison The essential feature of this approach is that it uses a well- formulated mathematical framework, and establishes consistent speech pattern representations for reliable pattern comparison from a set of labeled training samples via a formal training algorithm A speech pattern representation can be in the form of a speech template or a statistical model, and can be applied to a sound (smaller than a word), a word, or a phrase In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speech (the speech to be recognized) with each possible

Trang 5

FIGURE 47.1: Segmentation and labeling for word sequence“seven-six”.

pattern learned in the training stage, in order to determine the identity of the unknown according to the goodness of match of the patterns The pattern matching approach has become the predominant method of speech recognition in the last decade and we shall elaborate on it in subsequent sections

47.4.3 Artificial Intelligence Approach [ 3 , 4 ]

The “artificial intelligence approach” attempts to mechanize the recognition procedure according to the way a person applies intelligence in visualizing, analyzing, and characterizing speech based on a set of measured acoustic features Among the techniques used within this class of methods are use

of an expert system (e.g., a neural network) which integrates phonemic, lexical, syntactic, semantic, and even pragmatic knowledge for segmentation and labeling, and uses tools such as artificial neural networks for learning the relationships among phonetic events The focus in this approach has been mostly in the representation of knowledge and integration of knowledge sources This method has not been used widely in commercial systems

47.5 Speech Recognition by Pattern Matching

Figure47.2is a block diagram that depicts the pattern matching framework The speech signal is first analyzed and a feature representation is obtained for comparison with either stored reference templates or statistical models in the pattern matching block A decision scheme determines the word

or phonetic class of the unknown speech based on the matching scores with respect to the stored reference patterns

There are two types of reference patterns that can be used with the model of Fig.47.2 The first type, called a nonparametric reference pattern [5] (or often a template), is a pattern created from one

or more spoken tokens (exemplars) of the sound associated with the pattern The second type, called

a statistical reference model, is created as a statistical characterization (via a fixed type of model) of the behavior of a collection of tokens of the sound associated with the pattern The hidden Markov model [6] is an example of the statistical model

Trang 6

FIGURE 47.2: Block diagram of pattern-recognition speech recognizer.

The model of Fig.47.2has been used (either explicitly or implicitly) for almost all commercial and industrial speech recognition systems for the following reasons:

1 It is invariant to different speech vocabularies, user sets, feature sets, pattern matching algorithms, and decision rules

2 It is easy to implement in software (and hardware)

3 It works well in practice

We now discuss the elements of the pattern recognition model and show how it has been used in isolated word, connected word, and continuous speech recognition systems

47.5.1 Speech Analysis

The purpose of the speech analysis block is to transform the speech waveform into a parsimonious representation which characterizes the time varying properties of the speech The transformation

is normally done on successive and possibly overlapped short intervals 10 to 30 msec in duration (i.e., short-time analysis) due to the time-varying nature of speech The representation [7] could be spectral parameters, such as the output from a filter bank, a discrete Fourier transform (DFT), or a linear predictive coding (LPC) analysis, or they could be temporal parameters, such as the locations

of various zero or level crossing times in the speech signal

Empirical knowledge gained over decades of psychoacoustic studies suggests that the power spec-trum has the necessary acoustic information for high accuracy sound identity Studies in psychoa-coustics also suggest that our auditory perception of sound power and loudness involves compression, leading to the use of the logarithmic power spectrum and the cepstrum [8], which is the Fourier trans-form of the log-spectrum The low order cepstral coefficients (up to 10 to 20) provide a parsimonious representation of the short-time speech segment which is usually sufficient for phonetic identification The cepstral parameters are often augmented by the so-called delta cepstrum [9] which character-izes dynamic aspects of the time-varying speech process

47.5.2 Pattern Training

Pattern training is the method by which representative sound patterns (for the unit being trained) are converted into reference patterns for use by the pattern matching algorithm There are several ways in which pattern training can be performed, including:

1 Casual training in which a single sound pattern is used directly to create either a template

or a crude statistical model (due to the paucity of data)

2 Robust training in which several (typically 2 to 4) versions of the sound pattern (usually extracted from the speech of a single talker) are used to create a single merged template

or statistical model

Trang 7

3 Clustering training in which a large number of versions of the sound pattern (extracted from a wide range of talkers) is used to create one or more templates or a reliable statistical model of the sound pattern

In order to better understand how and why statistical models are so broadly used in speech recognition,

we now formally define an important class of statistical models, namely the hidden Markov model (HMM) [6]

The HMM

The HMM is a statistical characterization of both the dynamics (time varying nature) and statics (the spectral characterization of sounds) of speech during speaking of a sub-word unit, a word, or even a phrase The basic premise of the HMM is that a Markov chain can be used to describe the probabilistic nature of the temporal sequence of sounds in speech, i.e., the phonemes in the speech, via a probabilistic state sequence The states in the sequence are not observed with certainty because the correspondence between linguistic sounds and the speech waveform is probabilistic in nature; hence the concept of a hidden model Instead, the states manifest themselves through the second component of the HMM which is a set of output distributions governing the production of the speech features in each state (the spectral characterization of the sounds) In other words, the output distributions (which are observed) represent the local statistical knowledge of the speech pattern within the state, and the Markov chain characterizes, through a set of state transition probabilities, how these sound processes evolve from one sound to another Integrated together, the HMM is particularly well suited for modeling speech processes

FIGURE 47.3: Characterization of a word (or phrase, or subword) using a N(5) state, left-to-right, HMM, with continuous observation densities in each state of the model

An example of an HMM of a speech pattern is shown in Fig.47.3 The model has five states (corresponding to five distinct “sounds” or “phonemes” within the speech), and the state (corre-sponding to the sound being spoken) proceeds from left-to-right (as time progresses) Within each state (assumed to represent a stable acoustical distribution) the spectral features of the speech signal

Trang 8

are characterized by a mixture Gaussian density of spectral features (called the observation density), along with an energy distribution, and a state duration probability The states represent the changing temporal nature of the speech signal; hence indirectly they represent the speech sounds within the pattern

The training problem for HMMs consists of estimating the parameters of the statistical distributions within each state (e.g., means, variances, mixture gains, etc.), along with the state transition proba-bilities for the composite HMM Well-established techniques (e.g., the Baum-Welch method [10] or the segmentalK-means method [11]) have been defined for doing this pattern training efficiently

47.5.3 Pattern Matching

Pattern matching refers to the process of assessing the similarity between two speech patterns, one

of which represents the unknown speech and one of which represents the reference pattern (derived from the training process) of each element that can be recognized When the reference pattern is a

“typical” utterance template, pattern matching produces a gross similarity (or dissimilarity) score When the reference pattern consists of a probabilistic model, such as an HMM, the process of pattern matching is equivalent to using the statistical knowledge contained in the probabilistic model to assess the likelihood of the speech (which led to the model) being realized as the unknown pattern

FIGURE 47.4: Results of time aligning two versions of the word “seven”, showing linear alignment of the two utterances (top panel); optimal time-alignment path (middle panel); and nonlinearly aligned patterns (lower panel)

A major problem in comparing speech patterns is due to speaking rate variations HMMs provide

an implicit time normalization as part of the process for measuring likelihood However, for template

Trang 9

approaches, explicit time normalization is required Figure47.4demonstrates the effect of explicit time normalization between two patterns representing isolated word utterances The top panel of the figure shows the log energy contour of the two patterns (for the spoken word “seven”) — one called the reference (known) pattern and the other called the test (or unknown input) pattern It can be seen that the inherent duration of the two patterns, 30 and 35 frames (where each frame is a 15-ms segment of speech), is different and that linear alignment is grossly inadequate for internally aligning events within the two patterns (compare the locations of the vowel peaks in the two patterns) A basic principle of time alignment is to nonuniformly warp the time scale so as to achieve the best possible matching score between the two patterns (regardless of whether the two patterns are of the same word identity or not) This can be accomplished by a dynamic programming procedure, often called dynamic time warping (DTW) [12] when applied to speech template matching The “optimal” nonlinear alignment result of dynamic time warping is shown at the bottom of Fig.47.4in contrast

to the linear alignment of the patterns at the top It is clear that the nonlinear alignment provides a more realistic measure of similarity between the patterns

47.5.4 Decision Strategy

The decision strategy takes all the matching scores (from the unknown pattern to each of the stored reference patterns) into account, finds the “closest” match, and decides if the quality of the match is good enough to make a recognition decision If not, the user is asked to provide another token of the speech (e.g., the word or phrase) for another recognition attempt This is necessary because often the user may speak words that are incorrect in some sense (e.g., hesitation, incorrectly spoken word, etc.) or simply outside of the vocabulary of the recognition system

47.5.5 Results of Isolated Word Recognition

Using the pattern recognition model of Fig.47.2, and using either the non-parametric template approach or the statistical HMM method to derive reference patterns, a wide variety of tests of the recognizer have been performed on telephone speech with isolated word inputs in both speaker-dependent (SD) and speaker-inspeaker-dependent (SI) modes Vocabulary sizes have ranged from as few

as 10 words (i.e., the digits zero–nine) to as many as 1109 words Table47.1gives a summary of recognizer performance under the conditions described above

TABLE 47.1 Performance of Isolated Word Recognizers Vocabulary Mode Word error rate (%)

39 Alphadigits SI 7.0

129 Airline terms SI 2.9

1109 Basic English SD 4.3

47.6 Connected Word Recognition

The systems we have been describing in previous sections have all been isolated word recognition systems In this section we consider extensions of the basic processing methods described in

Trang 10

pre-vious sections in order to handle recognition of sequences of words, the so-called connected word recognition system

The basic approach to connected word recognition is shown in Fig.47.5 Assume we are given

a fluently spoken sequence of words, represented by the (unknown) test patternT , and we are also

given a set ofV reference patterns, {R1, R2, , R V} each representing one of the words in the vocabulary The connected word recognition problem consists of finding the concatenated reference pattern,R S, which best matches the test pattern, in the sense that the overall similarity betweenT

andR Sis maximum over all sequence lengths and over all combinations of vocabulary words.

FIGURE 47.5: Illustration of the problem of matching a connected word string, spoken fluently, using whole word patterns concatenated together to provide the best match

There are several problems associated with solving the connected word recognition problem, as formulated above First of all, we do not know how many words were spoken; hence, we have to consider solutions with a range on the number of words in the utterance Second, we do not know nor can we reliably find word boundaries within the test pattern Hence, we cannot use word boundary information to segment the problem into simple “word-matching” recognition problems Finally, since the combinatorics of trying to solve the problem exhaustively (by trying to match every possible string) are exponential in nature, we need to devise efficient algorithms to solve this problem Such efficient algorithms have been developed and they solve the connected word recognition problem

by iteratively building up time-aligned matches between sequences of reference patterns and the unknown test pattern, one frame at a time [13,14,15]

47.6.1 Performance of Connected Word Recognizers

Typical recognition performance for connected word recognizers is given in Table47.2for a range

of vocabularies, and for a range of associated tasks In the next section we will see how we exploit linguistic constraints of the task to improve recognition accuracy for word strings beyond the level one would expect on the basis of word error rates of the system

Tiêu đề	Speech Recognition by Machine
Tác giả	Lawrence R. Rabiner, B. H. Juang
Trường học	CRC Press LLC
Chuyên ngành	Speech Recognition
Thể loại	Tài liệu
Năm xuất bản	2000
Thành phố	New York

Định dạng
Số trang	17
Dung lượng	140,17 KB