Tài liệu Digital Signal Processing Handbook P50 ppt

File-based •Documentation of Processing History •Personalization•Real-Time Performance•Source Availability•Hardware Requirements•Cross-Platform Com-patibility•Degree of Specialization•Su

Trang 1

Shore, J “Software Tools for Speech Research and Development”

Digital Signal Processing Handbook

Ed Vijay K Madisetti and Douglas B Williams

Boca Raton: CRC Press LLC, 1999

Trang 2

Software Tools for Speech Research

and Development

John Shore

Entropic Research

Laboratory, Inc.

50.1 Introduction 50.2 Historical Highlights 50.3 The User’s Environment (OS-Based vs Workspace-Based)

Operating-System-Based Environment • Workspace-Based Environment

50.4 Compute-Oriented vs Display-Oriented

Compute-Oriented Software •Display-Oriented Software•

Hybrid Compute/Display-Oriented Software

50.5 Compiled vs Interpreted

Interpreted Software• Compiled Software•Hybrid Inter-preted/Compiled Software•Computation vs Display

50.6 Specifying Operations Among Signals

Text-Based Interfaces •Visual (“Point-and-Click”) Interfaces

•Parametric Control of Operations

50.7 Extensibility (Closed vs Open Systems) 50.8 Consistency Maintenance

50.9 Other Characteristics of Common Approaches

Memory-based vs File-based •Documentation of Processing

History •Personalization•Real-Time Performance•Source

Availability•Hardware Requirements•Cross-Platform Com-patibility•Degree of Specialization•Support for Speech Input and Output

50.10 File Formats (Data Import/Export) 50.11 Speech Databases

50.12 Summary of Characteristics and Uses 50.13 Sources for Finding Out What is Currently Available 50.14 Future Trends

References

50.1 Introduction

Experts in every field of study depend on specialized tools In the case of speech research and development, the dominant tools today are computer programs In this article, we present an overview of key technical approaches and features that are prevalent today

We restrict the discussion to software intended to support R&D, as opposed to software for com-mercial applications of speech processing For example, we ignore DSP programming (which is discussed in the previous article) Also, we concentrate on software intended to support the

Trang 3

special-ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this chapter However, much of what we have to say applies as well to the needs of those in such closely related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc

We do not attempt to survey available software packages, as the result would likely be obsolete by the time this book is printed The examples mentioned are illustrative, and not intended to provide a thorough or balanced review Our aim is to provide sufficient background so that readers can assess their needs and understand the differences among available tools Up-to-date surveys are readily available online (see Section50.13)

In general, there are three common uses of speech R&D software:

• Teaching, e.g., homework assignments for a basic course in speech processing

• Interactive, free-form exploration, e.g., designing a filter and evaluating its effects on a

speech processing system

• Batch experiments, e.g., training and testing speech coders or speech recognizers using a

large database

The relative importance of various features differs among these uses For example, in conducting batch experiments, it is important that large signals can be handled, and that complicated algorithms execute efficiently For teaching, on the other hand, these features are less important than simplicity, quick experimentation, and ease-of-use Because of practical limitations, such differences in priority mean that no one software package today can meet all needs

To explain the variation among current approaches, we identify a number of distinguishing char-acteristics These characteristics are not independent (i.e., there is considerable overlap), but they do help to present the overall view

For simplicity, we will refer to any particular speech R&D software as “the speech software”

50.2 Historical Highlights

Early or significant examples of speech R&D software include “Visible Speech” [5], MITSYN [1], and

Lloyd Rice’s WAVE program of the mid 1970s (not to be confused with David Talkin’s waves [8]) The first general, commercial system that achieved widespread acceptance was the Interactive

Laboratory System (ILS) from Signal Technology Incorporated, which was popular in the late 1970s and early 1980s Using the terminology defined below, ILS is compute-oriented software with an

operating-system-based environment The first popular, display-oriented, workspace-based speech

software was David Shipman’s LISP-machine application called Spire [6]

50.3 The User’s Environment

(OS-Based vs Workspace-Based)

In some cases, the user sees the speech software as an extension of the computer’s operating system

We call this “operating-system-based” (or OS-based); an example is the Entropic Signal Processing

System (ESPS) [7]

In other cases, the software provides its own operating environment We call this

“workspace-based” (from the term used in implementations of the programming language APL); an example is

MATLABTM(from The Mathworks)

Trang 4

50.3.1 Operating-System-Based Environment

In this approach, signals are represented as files under the native operating system (e.g., Unix, DOS), and the software consists of a set of programs that can be invoked separately to process or display signals in various ways Thus, the user sees the software as an extension of an already-familiar oper-ating system Because signals are represented as files, the speech software inherits file manipulation capabilities from the operating system Under Unix, for example, signals can be copied and moved respectively using thecp and mv programs, and they can be organized as directory trees in the Unix

hierarchical file system (including NFS)

Similarly, the speech software inherits extension capabilities inherent in the operating system

Under Unix, for example, extensions can be created using shell scripts in various languages (sh, csh, Tcl,

perl, etc.), as well as such facilities as pipes and remote execution OS-based speech software packages

are often called command-line packages because usage typically involves providing a sequence of commands to some type of shell

50.3.2 Workspace-Based Environment

In this approach, the user interacts with a single application program that takes over from the operating system Signals, which may or may not correspond to files, are typically represented as variables in some kind of virtual space Various commands are available to process or display the signals Such a workspace is often analogous to a personal blackboard

Workspace-based systems usually offer means for saving the current workspace contents and for loading previously saved workspaces

An extension mechanism is typically provided by a command interpreter for a simple language that includes the available operations and a means for encapsulating and invoking command sequences (e.g., in a function or procedure definition) In effect, the speech software provides its own shell to the user

50.4 Compute-Oriented vs Display-Oriented

This distinction concerns whether the speech software emphasizes computation or visualization or both

50.4.1 Compute-Oriented Software

If there is a large number of signal processing operations relative to the number of signal display operations, we say that the software is compute-oriented Such software typically can be operated without a display device and the user thinks of it primarily as a computation package that supports such functions as spectral analysis, filtering, linear prediction, quantization, analysis/synthesis, pattern classification, Hidden Markov Model (HMM) training, speech recognition, etc

Compute-oriented software can be either OS-based or workspace based Examples include ESPS,

MATLABTM, and the Hidden Markov Model Toolkit (HTK) (from Cambridge University and

En-tropic)

50.4.2 Display-Oriented Software

In contrast, display-oriented speech software is not intended to and often cannot operate without

a display device The primary purpose is to support visual inspection of waveforms, spectrograms, and other parametric representations The user typically interacts with the software using a mouse

or other pointing device to initiate display operations such as scrolling, zooming, enlarging, etc

Trang 5

While the software may also provide computations that can be performed on displayed signals (or marked segments of displayed signals), the user thinks of the software as supporting visualization

more than computation An example is the waves program [8]

50.4.3 Hybrid Compute/Display-Oriented Software

Hybrid compute/display software combines the best of both Interactions are typically by means

of a display device, but computational capabilities are rich The computational capabilities may be built-in to workspace-based speech software, or may be OS-based but accessible from the display

program Examples include the Computerized Speech Lab (CSL) from Kay Elemetrics Corp., and the combination of ESPS and waves.

50.5 Compiled vs Interpreted

Here we distinguish according to whether the bulk of the signal processing or display code (whether written by developers or users) is interpreted or compiled

50.5.1 Interpreted Software

The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical

Sciences, Inc., and MATLABTM), or may be an existing, general purpose language (e.g., LISP is used

in N!Power from Signal Technology, Inc.).

Compared to compiler languages, interpreter languages tend to be simpler and easier to learn Furthermore, it is usually easier and faster to write and test programs under an interpreter The disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run

As a result, interpreted speech software is usually better suited for teaching and interactive exploration than for batch experiments

50.5.2 Compiled Software

Compared to interpreted languages, compiled languages (e.g., FORTRAN, C, C++) tend to be more complicated and harder to learn Compared to interpreted programs, compiled programs are slower

to write and test, but considerably faster to run As a result, compiled speech software is usually better suited for batch experiments than for teaching

50.5.3 Hybrid Interpreted/Compiled Software

Some interpreters make it possible to create new language commands with an underlying implemen-tation that is compiled This allows a hybrid approach that can combine the best of both

Some languages provide a hybrid approach in which the source code is pre-compiled quickly into

intermediate code that is then (usually!) interpreted Java is a good example.

If compiled speech software is OS-based, signal processing scripts can typically be written in an interpretive language (e.g., ash script containing a sequence of calls to ESPS programs) Thus, hybrid

systems can also be based on compiled software

50.5.4 Computation vs Display

The distinction between compiled and interpreted languages is relevant mostly to the computational aspects of the speech software However, the distinction can apply as well to display software, since

Trang 6

some display programs are compiled (e.g., using Motif) while others exploit interpreters (e.g., Tcl/Tk,

Java).

50.6 Specifying Operations Among Signals

Here we are concerned with the means by which users specify what operations are to be done and on what signals This consideration is relevant to how speech software can be extended with user-defined operations (see Section50.7), but is an issue even in software that is not extensible

The main distinction is between a text-based interface and a visual (“point-and-click”) interface Visual interfaces tend to be less general but easier to use

50.6.1 Text-Based Interfaces

Traditional interfaces for specifying computations are based on a textual-representation in the form

of scripts and programs For OS-based speech software, operations are typically specified by typing the name of a command (with possible options) directly to a shell One can also enter a sequence of such commands into a text editor when preparing a script

This style of specifying operations also is available for workspace-based speech software that is based on a command interpreter In this case, the text comprises legal commands and programs in the interpreter language

Both OS-based and workspace-based speech software may also permit the specification of opera-tions using source code in a high-level language (e.g., C) that gets compiled

50.6.2 Visual (“Point-and-Click”) Interfaces

The point-and-click approach has become the ubiquitous user-interface of the 1990s Operations and operands (signals) are specified by using a mouse or other pointing device to interact with on-screen graphical user-interface (GUI) controls such as buttons and menus The interface may also have a text-based component to allow the direct entry of parameter values or formulas relating signals

Visual Interfaces for Display-Oriented Software

In display-oriented software, the signals on which operations are to be performed are visible

as waveforms or other directly representative graphics

A typical user-interaction proceeds as follows: A relevant signal is specified by a mouse-click operation (if a signal segment is involved, it is selected by a click-and-drag operation or by a pair of mouse-click operations) The operation to be performed is then specified by mouse click operations

on screen buttons, pull-down menus, or pop-up menus

This style works very well for unary operations (e.g., compute and display the spectrogram of a given signal segment), and moderately well for binary operations (e.g., add two signals) But it is awkward for operations that have more than two inputs It is also awkward for specifying chained calculations, especially if you want to repeat the calculations for a new set of signals

One solution to these problems is provided by a “calculator-style” interface that looks and acts like

a familiar arithmetic calculator (except the operands are signal names and the operations are signal processing operations)

Another solution is the “spreadsheet-style” interface The analogy with spreadsheets is tight Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.) connected logically by formulas For example, one cell might show a test signal, a second might show the results of filtering it, and a third might show a spectrogram of a portion of the filtered signal This exemplifies a spreadsheet-style interface for speech software

Trang 7

A spreadsheet-style interface provides some means for specifying the “formulas” that relate the various “cells” This formula interface might itself be implemented in a point-and-click fashion,

or it might permit direct entry of formulas in some interpretive language Speech software with

a spreadsheet-style interface will maintain consistency among the visible signals Thus, if one of the signals is edited or replaced, the other signal graphics change correspondingly, according to the underlying formulas

DADisp (from DSP Development Corporation) is an example of a spreadsheet-style interface.

Visual Interfaces for Compute-Oriented Software

In a visual interface for display-oriented software, the focus is on the signals themselves In

a visual interface for compute-oriented software, on the other hand, the focus is on the operations Operations among signals typically are represented as icons with one or more input and output lines that interconnect the operations In effect, the representation of a signal is reduced to a straight line indicating its relationship (input or output) with respect to operations Such visual interfaces are often called block-diagram interfaces In effect, a block-diagram interface provides a visual representation of the computation chain Various point-and-click means are provided to support the user in creating, examining, and modifying block diagrams

Ptolomy [4] and N!Power are examples of systems that provide a block-diagram interface.

Limitations of Visual Interfaces

Although much in vogue, visual interfaces are inherently limited as a means for specifying signal computations

For example, the analogy between spreadsheets and spreadsheet-style speech software continues For simple signal computations, the spreadsheet-style interface can be very useful; computations are simple to set up and informative when operating For complicated computations, however, the spreadsheet-style interface inherits all of the worst features of spreadsheet programming It is difficult to encapsulate common sub-calculations, and it is difficult to organize the “program” so that the computational structure is self-evident The result is that spreadsheet-style programs are hard to write, hard to read, and error-prone

In this respect, block-diagram interfaces do a better job since their main focus is on the underlying computation rather than on the signals themselves Thus, screen “real-estate” is devoted to the computation rather than to the signal graphics However, as the complexity of computations grows, the geometric and visual approach eventually becomes unwieldy When was the last time you used a flowchart to design or document a program?

It follows that visual interfaces for specifying computations tend to be best suited for teaching and interactive exploration

50.6.3 Parametric Control of Operations

Speech processing operations often are based on complicated algorithms with numerous parameters Consequently, the means for specifying parameters is an important issue for speech software The simplest form of parametric control is provided by command-line options on command-line programs This is convenient, but can be cumbersome if there are many parameters A common alternative is to read parameter values from parameter files that are prepared in advance Typically, command-line values can be used to override values in the parameter file A third input source for parameter values is directly from the user in response to prompts issued by the program

Some systems offer the flexibility of a hierarchy of inputs for parameter values, for example:

• default values

Trang 8

• values from a global parameter file read by all programs

• values from a program-specific parameter file

• values from the command line

• values from the user in response to run-time prompts

In some situations, it is helpful if a current default value is replaced by the most recent input from

a given parameter source We refer to this property as “parameter persistence”

50.7 Extensibility (Closed vs Open Systems)

Speech software is “closed” if there is no provision for the user to extend it There is a fixed set of operations available to process and display signals What you get is all you get

OS-based systems are always extensible to a degree because they inherit scripting capabilities from the OS, which permits the creation of new commands They may also provide programming libraries

so that the user can write and compile new programs and use them as commands

Workspace-based systems may be extensible if they are based on an interpreter whose programming language includes the concept of an encapsulated procedure If so, then users can write scripts that define new commands Some systems also allow the interpreter to be extended with commands that are implemented by underlying code in C or some other compiled language

In general, for speech software to be extensible, it must be possible to specify operations (see Section50.6) and also to re-use the resulting specifications in other contexts A block-diagram interface is extensible, for example, if a given diagram can be reduced to an icon that is available for use as a single block in another diagram

For speech software with visual interfaces, extensibility considerations also include the ability to specify new GUI controls (visible menus and buttons), the ability to tie arbitrary internal and external computations to GUI controls, and the ability to define new display methods for new signal types

In general, extended commands may behave differently from the built-in commands provided with the speech software For example, built-in commands may share a common user interface that is difficult to implement in an independent script or program (such a common interface might provide standard parameters for debug control, standard processing of parameter files, etc.)

If user-defined scripts, programs, and GUI components are indistinguishable from built-in facili-ties, we say that the speech software provides seamless extensibility

50.8 Consistency Maintenance

A speech processing chain involves signals, operations, and parameter sets An important consid-eration for speech software is whether or not consistency is maintained among all of these Thus, for example, if one input signal is replaced with another, are all intermediate and output signals recalculated automatically? Consistency maintenance is primarily an issue for speech software with visual interfaces, namely whether or not the software guarantees that all aspects of the visible displays are consistent with each other

Spreadsheet-style interfaces (for display-oriented software) and block-diagram interfaces (for compute-oriented software) usually provide consistency maintenance

Trang 9

50.9 Other Characteristics of Common Approaches

50.9.1 Memory-based vs File-based

“Memory-based” speech software carries out all of its processing and display operations on signals that are stored entirely within memory, regardless of whether or not the signals also have an external representation as a disk file This approach has obvious limitations with respect to signal size, but it simplifies programming and yields fast operation Thus, memory-based software is well-suited for teaching and the interactive exploration of small samples

In “file-based” speech software, on the other hand, signals are represented and manipulated as disk files The software partially buffers portions of the signal in memory as required for processing and display operations Although programming can be more complicated, the advantage is that there are no inherent limitations on signal size The file-based approach is, therefore, well-suited for large scale experiments

50.9.2 Documentation of Processing History

Modern speech processing involves complicated algorithms with many processing steps and operating parameters As a result, it is often important to be able to reconstruct exactly how a given signal was produced Speech software can help here by creating appropriate records as signal and parameter files are processed

The most common method for recording this information about a given signal is to put it in the same file as the signal Most modern speech software uses a file format that includes a “file header” that is used for this purpose Most systems store at least some information in the header, e.g., the

sampling rate of the signal Others, such as ESPS, attempt to store all relevant information In this

approach, the header of a signal file produced by any program includes the program name, values

of processing parameters, and the names and headers of all source files The header is a recursive structure, so that the headers of the source files themselves contain the names and headers of files that were prior sources Thus, a signal file header contains the headers of all source files in the processing chain It follows that files contain a complete history of the origin of the data in the file and all the intermediate processing steps The importance of record keeping grows with the complexity of computation chains and the extent of available parametric control

50.9.3 Personalization

There is considerable variation in the extent to which speech software can be customized to suit personal requirements and tastes Some systems cannot be personalized at all; they start out the same way, every time But most systems store personal preferences and use them again next time Savable preferences may include color selections, button layout, button semantics, menu contents, currently loaded signals, visible windows, window arrangement, and default parameter sets for speech processing operations

At the extreme, some systems can save a complete “snapshot” that permits exact resumption This

is particularly important for the interactive study of complicated signal configurations across repeated software sessions

50.9.4 Real-Time Performance

Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs

In the case of speech software, this usually means that the software can keep up with input speech Even this definition is not particularly meaningful unless the input speech is itself coming from a

Trang 10

human speaker and digitized in real-time Otherwise, the real-issue is whether or not the software is fast enough to keep up with interactive use

For example, if one is testing speech recognition software by directly speaking into the computer, real-time performance is important It is less important, on the other hand, if the test procedure involves running batch scripts on a database of speech files

If the speech software is designed to take input directly from devices (or pipes, in the case of Unix), then the issue becomes one of CPU speed

50.9.5 Source Availability

It is unfortunate but true that the best documentation for a given speech processing command is often the source code Thus, the availability of source code may be an important factor for this reason alone Typically, this is more important when the software is used in advanced R&D applications Sources also are needed if users have requirements to port the speech software to additional platforms Source availability may also be important for extensibility, since it may not be possible to extend the speech software without the sources

If the speech software is interpreter-based, sources of interest will include the sources for any built-in operations that are implemented as interpreter scripts

50.9.6 Hardware Requirements

Speech software may require the installation of special purpose hardware There are two main reasons for such requirements: to accelerate particular computations (e.g., spectrograms), and to provide speech I/O with A/D and D/A converters

Such hardware has several disadvantages It adds to the system cost, and it decreases the overall reliability of the system It may also constrain system software upgrades; for example, the extra hardware may use special device drivers that do not survive OS upgrades Special purpose hardware used to be common, but is less so now owing to the continuing increase in CPU speeds and the prevalence of built-in audio I/O It is still important, however, when maximum speed and high-quality

audio I/O are important CSL is a good example of an integrated hardware/software approach.

50.9.7 Cross-Platform Compatibility

If your hardware platform may change or your site has a variety of platforms, then it is important

to consider whether the speech software is available across a variety of platforms Source availability (Section50.9.5) is relevant here

If you intend to run the speech software on several platforms that have different underlying numeric representations (a byte order difference being most likely), then it is important to know whether the file formats and signal I/O software support transparent data exchange

50.9.8 Degree of Specialization

Some speech software is intended for general purpose work in speech (e.g., ESPS/waves, MATLABTM) Other software is intended for more specialized usage Some of the areas where specialized software tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics,

clinical-voice, music, multi-media, sound and vibration, etc Two examples are HTK for recognition, and

Delta (from Eloquent Technology) for synthesis.

Tiêu đề	Software tools for speech research and development
Tác giả	John Shore
Chuyên ngành	Digital Signal Processing
Thể loại	Chapter
Năm xuất bản	1999
Thành phố	Boca Raton

Định dạng
Số trang	13
Dung lượng	168,01 KB