Tài liệu SEC 10 pptx

Rabiner AT&T Labs — Research and Juergen Schroeter Introduction •Geometry of the Vocal and Nasal Tracts•Acoustical Properties of the Vocal and Nasal Tracts •Sources of Excitation•Digital

Trang 1

Speech Processing

Richard V Cox

AT&T Labs — Research

Lawrence R Rabiner

AT&T Labs — Research

and Juergen Schroeter

Introduction •Geometry of the Vocal and Nasal Tracts•Acoustical Properties of the Vocal and

Nasal Tracts •Sources of Excitation•Digital Implementations

Introduction •Useful Models for Speech and Hearing•Types of Speech Coders•Current Standards

46 Text-to-Speech Synthesis Richard Sproat and Joseph Olive

Introduction •Text Analysis and Linguistic Analysis•Speech Synthesis•The Future of TTS

Introduction •Characterization of Speech Recognition Systems•Sources of Variability of Speech

•Approaches to ASR by Machine•Speech Recognition by Pattern Matching•Connected Word

Recognition •Continuous Speech Recognition•Speech Recognition System Issues•Practical Issues

in Speech Recognition •ASR Applications

48 Speaker Verification Sadaoki Furui and Aaron E Rosenberg

Introduction •Personal Identity Characteristics•Vocal Personal Identity Characteristics•Basic

Elements of a Speaker Recognition System •Extracting Speaker Information from the Speech Signal

•Feature Similarity Measurements•Units of Speech for Representing Speakers•Input Modes•

Representations •Optimizing Criteria for Model Construction•Model Training and Updating•

Signal Feature and Score Normalization Techniques •Decision Process•Outstanding Issues

Software Development Targets •Software Development Paradigms•Assembly Language Basics•

Arithmetic •Algorithmic Constructs

Introduction •Historical Highlights•The User’s Environment (OS-Based vs Workspace-Based)

•Compute-Oriented vs Display-Oriented•Compiled vs Interpreted•Specifying Operations

Among Signals •Extensibility (Closed vs Open Systems)•Consistency Maintenance•Other

Characteristics of Common Approaches •File Formats (Data Import/Export)•Speech Databases

•Summary of Characteristics and Uses•Sources for Finding Out What is Currently Available•

Future Trends

Trang 2

W ITH THE ADVENT OF CHEAP, HIGH SPEED PROCESSORS, and with the

ever-decreasing cost of memory, the cost of speech processing has been driven down to the point where it can be (and has been) embedded in almost any system, from a low cost consumer product (e.g., solid-state digital answering machines, voice controlled telephones, etc.),

to a desktop application (e.g., voice dictation of a first draft quality manuscript), to an application embedded in a voice or data network (e.g., voice dialing, packet telephony, voice browser for the Internet, etc.) It is the purpose of this section of the Handbook to provide discussions of several

of the key technologies in speech processing and to illustrate how the technologies are implemented using special-purpose DSP processor chips or via standard software packages running on more con-ventional processors

The broad area of speech processing can be broken down into several individual areas according

to both applications and technology These include:

1 Speech Production Models and their Digital Implementations (see Chapter 44 by Sondhi and

Schroeter) In order to understand how the characteristics of a speech signal can be exploited

in the different application areas, it is necessary to understand the properties and constraints

of the human vocal apparatus (to understand how speech is generated by humans) It is also necessary to understand the way in which models can be built that simulate speech production

as well as the ways in which they can be implemented as digital systems, since such models form the basis for almost all practical speech processing systems

2 Speech Coding (see Chapter 45 by Cox) Speech coding is the process of compressing the

information in a speech signal so as to either transit it or store it economically over a channel whose bandwidth is significantly smaller than that of the uncompressed signal Speech coding is used as the basis for most modern voice messaging and voice mail systems, for voice response systems, for digital cellular and for satellite transmission of speech, for packet telephony, for ISDN teleconferencing, and for digital answering machines and digital voice encryption machines

3 Text-to-Speech Synthesis (see Chapter 46 by Sproat and Olive) Speech synthesis is the process

of creating a synthetic replica of a speech signal so as to transmit a message from a machine

to a person, with the purpose of conveying the information in the message Speech synthesis

is often called “text-to-speech” or TTS, to convey the idea that, in general, the input to the system is ordinary ASCII text, and the output of the system is ordinary speech The goal of most speech synthesis systems is to provide a broad range of capability for having a machine speak information (stored in the machine) to a user Key aspects of synthesis systems are the intelligibility and the naturalness of the resulting speech The major applications of speech synthesis include acting as a voice server for text-based information services (e.g., stock prices, sports scores, flight information); providing a means for reading e-mail, or the text portions

of FAX messages over ordinary phone lines; providing a means for previewing text stored in documents (e.g., document drafts, Internet files); and finally as a voice readout for handheld devices, (e.g., phrase book translators, dictionaries, etc.)

4 Speech Recognition by Machine (see Chapter 47 by Rabiner and Juang) Speech recognition

is the process of extracting the message information in a speech signal so as to control the action of a machine in response to spoken commands In a sense, speech recognition is the complementary process to speech synthesis, and together they constitute the building blocks

of a voice dialogue system with a machine There are many factors which influence the type

of speech recognition system that is used for different applications, including the mode of speaking to the machine (e.g., single commands, digit sequences, fluent sentences), the size and complexity of the vocabulary which the machine understands, the task which the machine

Trang 3

is asked to accomplish, the environment in which the recognition system must run, and finally the cost of the system Although there is a wide range of applications of speech recognition systems, the most generic systems are simple “command-and-control” systems (with menu-like interfaces), and the most advanced systems support full voice dialogues for dictation, forms entry, catalog ordering, reservation services, etc

5 Speaker Verification (see Chapter 48 by Furui and Rosenberg) Speaker verification is the

process of verifying the claimed identity of a speaker for the purpose of restricting access

to information (e.g., personal or private records), networks (computer, PBX), or physical premises The basic problem of speaker verification is to decide whether or not an unknown speech sample was spoken by the individual whose identity was claimed A key aspect of any speaker verification system is to accept the true speaker as often as possible while rejecting the impostor as often as possible Since these are inherently conflicting goals, all practical systems arrive at some compromise between levels of these two types of system errors The major area of application for speaker verification is in access control to information, credit, banking, machines, computer networks, private branch exchanges (PBX’s), and even premises The concept of a “voice lock” that prevents access until the appropriate speech by the authorized individual(s) (e.g., “Open Sesame”) is “heard” by the system is made a reality using speaker verification technology

6 DSP Implementations of Speech Processing (see Chapter 49 by Baudendistel) Until a few

years ago, almost all speech processing systems were implemented on low-cost DSP fixed-point processors because of their high efficiency in realizing the computational aspects of the various signal processing algorithms A key problem in the realization of any digital system in integer DSP code is how to map an algorithm efficiently (in both time and space) which is typically running in floating point C code on a workstation to integer C code that takes advantage of the unique characteristics of different DSP chips Furthermore, because of the rate of change of technology, it is essential that the conversion to DSP code occur rapidly (e.g., on the order of 3-person months) or else by the time a given algorithm is mapped to a specific DSP processor,

a new (faster, cheaper) generation of DSP chips will have evolved, obsoleting the entire process

7 Software Tools for Speech Research and Development (see Chapter 50 by Shore) The field

of speech processing has become a complex one, where an investigator needs a broad range

of tools to record, digitize, display, manipulate, process, store, format, analyze, and listen

to speech in its different file forms and manifestations Although it is conceivable that an individual could create a suite of software tools for an individual application, that process would be highly inefficient and would undoubtedly result in tools which were significantly less powerful than those developed in the commercial sector, such as the Entropic Signal Processing System, MATLAB, Waves, Interactive Laboratory System (ILS), or the commercial packages for TTS and speech recognition such as the Hidden Markov Model Toolkit (HTK)

The material presented in this section should provide the reader with a framework for understand-ing the signal processunderstand-ing aspects of speech processunderstand-ing and some pointers into the literature for further investigation of this fascinating and rapidly evolving field

Tiêu đề	Speech Processing
Tác giả	Richard V. Cox, Lawrence R. Rabiner, Juergen Schroeter, M. Mohan Sondhi, Richard Sproat, Joseph Olive, B. H. Juang, Sadaoki Furui, Aaron E. Rosenberg, Kurt Baudendistel, John Shore
Trường học	AT&T Labs — Research
Chuyên ngành	Speech Processing
Thể loại	Tài liệu
Năm xuất bản	1999
Thành phố	New Jersey

Định dạng
Số trang	3
Dung lượng	30,71 KB