Rabiner AT&T Labs — Research and Juergen Schroeter Introduction •Geometry of the Vocal and Nasal Tracts•Acoustical Properties of the Vocal and Nasal Tracts •Sources of Excitation•Digital
Trang 1Speech Processing
Richard V Cox
AT&T Labs — Research
Lawrence R Rabiner
AT&T Labs — Research
and Juergen Schroeter
Introduction •Geometry of the Vocal and Nasal Tracts•Acoustical Properties of the Vocal and
Nasal Tracts •Sources of Excitation•Digital Implementations
Introduction •Useful Models for Speech and Hearing•Types of Speech Coders•Current Standards
46 Text-to-Speech Synthesis Richard Sproat and Joseph Olive
Introduction •Text Analysis and Linguistic Analysis•Speech Synthesis•The Future of TTS
Introduction •Characterization of Speech Recognition Systems•Sources of Variability of Speech
•Approaches to ASR by Machine•Speech Recognition by Pattern Matching•Connected Word
Recognition •Continuous Speech Recognition•Speech Recognition System Issues•Practical Issues
in Speech Recognition •ASR Applications
48 Speaker Verification Sadaoki Furui and Aaron E Rosenberg
Introduction •Personal Identity Characteristics•Vocal Personal Identity Characteristics•Basic
Elements of a Speaker Recognition System •Extracting Speaker Information from the Speech Signal
•Feature Similarity Measurements•Units of Speech for Representing Speakers•Input Modes•
Representations •Optimizing Criteria for Model Construction•Model Training and Updating•
Signal Feature and Score Normalization Techniques •Decision Process•Outstanding Issues
Software Development Targets •Software Development Paradigms•Assembly Language Basics•
Arithmetic •Algorithmic Constructs
Introduction •Historical Highlights•The User’s Environment (OS-Based vs Workspace-Based)
•Compute-Oriented vs Display-Oriented•Compiled vs Interpreted•Specifying Operations
Among Signals •Extensibility (Closed vs Open Systems)•Consistency Maintenance•Other
Characteristics of Common Approaches •File Formats (Data Import/Export)•Speech Databases
•Summary of Characteristics and Uses•Sources for Finding Out What is Currently Available•
Future Trends
Trang 2W ITH THE ADVENT OF CHEAP, HIGH SPEED PROCESSORS, and with the
ever-decreasing cost of memory, the cost of speech processing has been driven down to the point where it can be (and has been) embedded in almost any system, from a low cost consumer product (e.g., solid-state digital answering machines, voice controlled telephones, etc.),
to a desktop application (e.g., voice dictation of a first draft quality manuscript), to an application embedded in a voice or data network (e.g., voice dialing, packet telephony, voice browser for the Internet, etc.) It is the purpose of this section of the Handbook to provide discussions of several
of the key technologies in speech processing and to illustrate how the technologies are implemented using special-purpose DSP processor chips or via standard software packages running on more con-ventional processors
The broad area of speech processing can be broken down into several individual areas according
to both applications and technology These include:
1 Speech Production Models and their Digital Implementations (see Chapter 44 by Sondhi and
Schroeter) In order to understand how the characteristics of a speech signal can be exploited
in the different application areas, it is necessary to understand the properties and constraints
of the human vocal apparatus (to understand how speech is generated by humans) It is also necessary to understand the way in which models can be built that simulate speech production
as well as the ways in which they can be implemented as digital systems, since such models form the basis for almost all practical speech processing systems
2 Speech Coding (see Chapter 45 by Cox) Speech coding is the process of compressing the
information in a speech signal so as to either transit it or store it economically over a channel whose bandwidth is significantly smaller than that of the uncompressed signal Speech coding is used as the basis for most modern voice messaging and voice mail systems, for voice response systems, for digital cellular and for satellite transmission of speech, for packet telephony, for ISDN teleconferencing, and for digital answering machines and digital voice encryption machines
3 Text-to-Speech Synthesis (see Chapter 46 by Sproat and Olive) Speech synthesis is the process
of creating a synthetic replica of a speech signal so as to transmit a message from a machine
to a person, with the purpose of conveying the information in the message Speech synthesis
is often called “text-to-speech” or TTS, to convey the idea that, in general, the input to the system is ordinary ASCII text, and the output of the system is ordinary speech The goal of most speech synthesis systems is to provide a broad range of capability for having a machine speak information (stored in the machine) to a user Key aspects of synthesis systems are the intelligibility and the naturalness of the resulting speech The major applications of speech synthesis include acting as a voice server for text-based information services (e.g., stock prices, sports scores, flight information); providing a means for reading e-mail, or the text portions
of FAX messages over ordinary phone lines; providing a means for previewing text stored in documents (e.g., document drafts, Internet files); and finally as a voice readout for handheld devices, (e.g., phrase book translators, dictionaries, etc.)
4 Speech Recognition by Machine (see Chapter 47 by Rabiner and Juang) Speech recognition
is the process of extracting the message information in a speech signal so as to control the action of a machine in response to spoken commands In a sense, speech recognition is the complementary process to speech synthesis, and together they constitute the building blocks
of a voice dialogue system with a machine There are many factors which influence the type
of speech recognition system that is used for different applications, including the mode of speaking to the machine (e.g., single commands, digit sequences, fluent sentences), the size and complexity of the vocabulary which the machine understands, the task which the machine
Trang 3is asked to accomplish, the environment in which the recognition system must run, and finally the cost of the system Although there is a wide range of applications of speech recognition systems, the most generic systems are simple “command-and-control” systems (with menu-like interfaces), and the most advanced systems support full voice dialogues for dictation, forms entry, catalog ordering, reservation services, etc
5 Speaker Verification (see Chapter 48 by Furui and Rosenberg) Speaker verification is the
process of verifying the claimed identity of a speaker for the purpose of restricting access
to information (e.g., personal or private records), networks (computer, PBX), or physical premises The basic problem of speaker verification is to decide whether or not an unknown speech sample was spoken by the individual whose identity was claimed A key aspect of any speaker verification system is to accept the true speaker as often as possible while rejecting the impostor as often as possible Since these are inherently conflicting goals, all practical systems arrive at some compromise between levels of these two types of system errors The major area of application for speaker verification is in access control to information, credit, banking, machines, computer networks, private branch exchanges (PBX’s), and even premises The concept of a “voice lock” that prevents access until the appropriate speech by the authorized individual(s) (e.g., “Open Sesame”) is “heard” by the system is made a reality using speaker verification technology
6 DSP Implementations of Speech Processing (see Chapter 49 by Baudendistel) Until a few
years ago, almost all speech processing systems were implemented on low-cost DSP fixed-point processors because of their high efficiency in realizing the computational aspects of the various signal processing algorithms A key problem in the realization of any digital system in integer DSP code is how to map an algorithm efficiently (in both time and space) which is typically running in floating point C code on a workstation to integer C code that takes advantage of the unique characteristics of different DSP chips Furthermore, because of the rate of change of technology, it is essential that the conversion to DSP code occur rapidly (e.g., on the order of 3-person months) or else by the time a given algorithm is mapped to a specific DSP processor,
a new (faster, cheaper) generation of DSP chips will have evolved, obsoleting the entire process
7 Software Tools for Speech Research and Development (see Chapter 50 by Shore) The field
of speech processing has become a complex one, where an investigator needs a broad range
of tools to record, digitize, display, manipulate, process, store, format, analyze, and listen
to speech in its different file forms and manifestations Although it is conceivable that an individual could create a suite of software tools for an individual application, that process would be highly inefficient and would undoubtedly result in tools which were significantly less powerful than those developed in the commercial sector, such as the Entropic Signal Processing System, MATLAB, Waves, Interactive Laboratory System (ILS), or the commercial packages for TTS and speech recognition such as the Hidden Markov Model Toolkit (HTK)
The material presented in this section should provide the reader with a framework for understand-ing the signal processunderstand-ing aspects of speech processunderstand-ing and some pointers into the literature for further investigation of this fascinating and rapidly evolving field