File-based •Documentation of Processing History •Personalization•Real-Time Performance•Source Availability•Hardware Requirements•Cross-Platform Com-patibility•Degree of Specialization•Su
Trang 1Shore, J “Software Tools for Speech Research and Development”
Digital Signal Processing Handbook
Ed Vijay K Madisetti and Douglas B Williams
Boca Raton: CRC Press LLC, 1999
Trang 2Software Tools for Speech Research
and Development
John Shore
Entropic Research
Laboratory, Inc.
50.1 Introduction 50.2 Historical Highlights 50.3 The User’s Environment (OS-Based vs Workspace-Based)
Operating-System-Based Environment • Workspace-Based Environment
50.4 Compute-Oriented vs Display-Oriented
Compute-Oriented Software •Display-Oriented Software•
Hybrid Compute/Display-Oriented Software
50.5 Compiled vs Interpreted
Interpreted Software• Compiled Software•Hybrid Inter-preted/Compiled Software•Computation vs Display
50.6 Specifying Operations Among Signals
Text-Based Interfaces •Visual (“Point-and-Click”) Interfaces
•Parametric Control of Operations
50.7 Extensibility (Closed vs Open Systems) 50.8 Consistency Maintenance
50.9 Other Characteristics of Common Approaches
Memory-based vs File-based •Documentation of Processing
History •Personalization•Real-Time Performance•Source
Availability•Hardware Requirements•Cross-Platform Com-patibility•Degree of Specialization•Support for Speech Input and Output
50.10 File Formats (Data Import/Export) 50.11 Speech Databases
50.12 Summary of Characteristics and Uses 50.13 Sources for Finding Out What is Currently Available 50.14 Future Trends
References
50.1 Introduction
Experts in every field of study depend on specialized tools In the case of speech research and development, the dominant tools today are computer programs In this article, we present an overview of key technical approaches and features that are prevalent today
We restrict the discussion to software intended to support R&D, as opposed to software for com-mercial applications of speech processing For example, we ignore DSP programming (which is discussed in the previous article) Also, we concentrate on software intended to support the
Trang 3special-ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this chapter However, much of what we have to say applies as well to the needs of those in such closely related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc
We do not attempt to survey available software packages, as the result would likely be obsolete by the time this book is printed The examples mentioned are illustrative, and not intended to provide a thorough or balanced review Our aim is to provide sufficient background so that readers can assess their needs and understand the differences among available tools Up-to-date surveys are readily available online (see Section50.13)
In general, there are three common uses of speech R&D software:
• Teaching, e.g., homework assignments for a basic course in speech processing
• Interactive, free-form exploration, e.g., designing a filter and evaluating its effects on a
speech processing system
• Batch experiments, e.g., training and testing speech coders or speech recognizers using a
large database
The relative importance of various features differs among these uses For example, in conducting batch experiments, it is important that large signals can be handled, and that complicated algorithms execute efficiently For teaching, on the other hand, these features are less important than simplicity, quick experimentation, and ease-of-use Because of practical limitations, such differences in priority mean that no one software package today can meet all needs
To explain the variation among current approaches, we identify a number of distinguishing char-acteristics These characteristics are not independent (i.e., there is considerable overlap), but they do help to present the overall view
For simplicity, we will refer to any particular speech R&D software as “the speech software”
50.2 Historical Highlights
Early or significant examples of speech R&D software include “Visible Speech” [5], MITSYN [1], and
Lloyd Rice’s WAVE program of the mid 1970s (not to be confused with David Talkin’s waves [8]) The first general, commercial system that achieved widespread acceptance was the Interactive
Laboratory System (ILS) from Signal Technology Incorporated, which was popular in the late 1970s and early 1980s Using the terminology defined below, ILS is compute-oriented software with an
operating-system-based environment The first popular, display-oriented, workspace-based speech
software was David Shipman’s LISP-machine application called Spire [6]
50.3 The User’s Environment
(OS-Based vs Workspace-Based)
In some cases, the user sees the speech software as an extension of the computer’s operating system
We call this “operating-system-based” (or OS-based); an example is the Entropic Signal Processing
System (ESPS) [7]
In other cases, the software provides its own operating environment We call this
“workspace-based” (from the term used in implementations of the programming language APL); an example is
MATLABTM(from The Mathworks)
Trang 450.3.1 Operating-System-Based Environment
In this approach, signals are represented as files under the native operating system (e.g., Unix, DOS), and the software consists of a set of programs that can be invoked separately to process or display signals in various ways Thus, the user sees the software as an extension of an already-familiar oper-ating system Because signals are represented as files, the speech software inherits file manipulation capabilities from the operating system Under Unix, for example, signals can be copied and moved respectively using thecp and mv programs, and they can be organized as directory trees in the Unix
hierarchical file system (including NFS)
Similarly, the speech software inherits extension capabilities inherent in the operating system
Under Unix, for example, extensions can be created using shell scripts in various languages (sh, csh, Tcl,
perl, etc.), as well as such facilities as pipes and remote execution OS-based speech software packages
are often called command-line packages because usage typically involves providing a sequence of commands to some type of shell
50.3.2 Workspace-Based Environment
In this approach, the user interacts with a single application program that takes over from the operating system Signals, which may or may not correspond to files, are typically represented as variables in some kind of virtual space Various commands are available to process or display the signals Such a workspace is often analogous to a personal blackboard
Workspace-based systems usually offer means for saving the current workspace contents and for loading previously saved workspaces
An extension mechanism is typically provided by a command interpreter for a simple language that includes the available operations and a means for encapsulating and invoking command sequences (e.g., in a function or procedure definition) In effect, the speech software provides its own shell to the user
50.4 Compute-Oriented vs Display-Oriented
This distinction concerns whether the speech software emphasizes computation or visualization or both
50.4.1 Compute-Oriented Software
If there is a large number of signal processing operations relative to the number of signal display operations, we say that the software is compute-oriented Such software typically can be operated without a display device and the user thinks of it primarily as a computation package that supports such functions as spectral analysis, filtering, linear prediction, quantization, analysis/synthesis, pattern classification, Hidden Markov Model (HMM) training, speech recognition, etc
Compute-oriented software can be either OS-based or workspace based Examples include ESPS,
MATLABTM, and the Hidden Markov Model Toolkit (HTK) (from Cambridge University and
En-tropic)
50.4.2 Display-Oriented Software
In contrast, display-oriented speech software is not intended to and often cannot operate without
a display device The primary purpose is to support visual inspection of waveforms, spectrograms, and other parametric representations The user typically interacts with the software using a mouse
or other pointing device to initiate display operations such as scrolling, zooming, enlarging, etc
Trang 5While the software may also provide computations that can be performed on displayed signals (or marked segments of displayed signals), the user thinks of the software as supporting visualization
more than computation An example is the waves program [8]
50.4.3 Hybrid Compute/Display-Oriented Software
Hybrid compute/display software combines the best of both Interactions are typically by means
of a display device, but computational capabilities are rich The computational capabilities may be built-in to workspace-based speech software, or may be OS-based but accessible from the display
program Examples include the Computerized Speech Lab (CSL) from Kay Elemetrics Corp., and the combination of ESPS and waves.
50.5 Compiled vs Interpreted
Here we distinguish according to whether the bulk of the signal processing or display code (whether written by developers or users) is interpreted or compiled
50.5.1 Interpreted Software
The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical
Sciences, Inc., and MATLABTM), or may be an existing, general purpose language (e.g., LISP is used
in N!Power from Signal Technology, Inc.).
Compared to compiler languages, interpreter languages tend to be simpler and easier to learn Furthermore, it is usually easier and faster to write and test programs under an interpreter The disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run
As a result, interpreted speech software is usually better suited for teaching and interactive exploration than for batch experiments
50.5.2 Compiled Software
Compared to interpreted languages, compiled languages (e.g., FORTRAN, C, C++) tend to be more complicated and harder to learn Compared to interpreted programs, compiled programs are slower
to write and test, but considerably faster to run As a result, compiled speech software is usually better suited for batch experiments than for teaching
50.5.3 Hybrid Interpreted/Compiled Software
Some interpreters make it possible to create new language commands with an underlying implemen-tation that is compiled This allows a hybrid approach that can combine the best of both
Some languages provide a hybrid approach in which the source code is pre-compiled quickly into
intermediate code that is then (usually!) interpreted Java is a good example.
If compiled speech software is OS-based, signal processing scripts can typically be written in an interpretive language (e.g., ash script containing a sequence of calls to ESPS programs) Thus, hybrid
systems can also be based on compiled software
50.5.4 Computation vs Display
The distinction between compiled and interpreted languages is relevant mostly to the computational aspects of the speech software However, the distinction can apply as well to display software, since
Trang 6some display programs are compiled (e.g., using Motif) while others exploit interpreters (e.g., Tcl/Tk,
Java).
50.6 Specifying Operations Among Signals
Here we are concerned with the means by which users specify what operations are to be done and on what signals This consideration is relevant to how speech software can be extended with user-defined operations (see Section50.7), but is an issue even in software that is not extensible
The main distinction is between a text-based interface and a visual (“point-and-click”) interface Visual interfaces tend to be less general but easier to use
50.6.1 Text-Based Interfaces
Traditional interfaces for specifying computations are based on a textual-representation in the form
of scripts and programs For OS-based speech software, operations are typically specified by typing the name of a command (with possible options) directly to a shell One can also enter a sequence of such commands into a text editor when preparing a script
This style of specifying operations also is available for workspace-based speech software that is based on a command interpreter In this case, the text comprises legal commands and programs in the interpreter language
Both OS-based and workspace-based speech software may also permit the specification of opera-tions using source code in a high-level language (e.g., C) that gets compiled
50.6.2 Visual (“Point-and-Click”) Interfaces
The point-and-click approach has become the ubiquitous user-interface of the 1990s Operations and operands (signals) are specified by using a mouse or other pointing device to interact with on-screen graphical user-interface (GUI) controls such as buttons and menus The interface may also have a text-based component to allow the direct entry of parameter values or formulas relating signals
Visual Interfaces for Display-Oriented Software
In display-oriented software, the signals on which operations are to be performed are visible
as waveforms or other directly representative graphics
A typical user-interaction proceeds as follows: A relevant signal is specified by a mouse-click operation (if a signal segment is involved, it is selected by a click-and-drag operation or by a pair of mouse-click operations) The operation to be performed is then specified by mouse click operations
on screen buttons, pull-down menus, or pop-up menus
This style works very well for unary operations (e.g., compute and display the spectrogram of a given signal segment), and moderately well for binary operations (e.g., add two signals) But it is awkward for operations that have more than two inputs It is also awkward for specifying chained calculations, especially if you want to repeat the calculations for a new set of signals
One solution to these problems is provided by a “calculator-style” interface that looks and acts like
a familiar arithmetic calculator (except the operands are signal names and the operations are signal processing operations)
Another solution is the “spreadsheet-style” interface The analogy with spreadsheets is tight Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.) connected logically by formulas For example, one cell might show a test signal, a second might show the results of filtering it, and a third might show a spectrogram of a portion of the filtered signal This exemplifies a spreadsheet-style interface for speech software
Trang 7A spreadsheet-style interface provides some means for specifying the “formulas” that relate the various “cells” This formula interface might itself be implemented in a point-and-click fashion,
or it might permit direct entry of formulas in some interpretive language Speech software with
a spreadsheet-style interface will maintain consistency among the visible signals Thus, if one of the signals is edited or replaced, the other signal graphics change correspondingly, according to the underlying formulas
DADisp (from DSP Development Corporation) is an example of a spreadsheet-style interface.
Visual Interfaces for Compute-Oriented Software
In a visual interface for display-oriented software, the focus is on the signals themselves In
a visual interface for compute-oriented software, on the other hand, the focus is on the operations Operations among signals typically are represented as icons with one or more input and output lines that interconnect the operations In effect, the representation of a signal is reduced to a straight line indicating its relationship (input or output) with respect to operations Such visual interfaces are often called block-diagram interfaces In effect, a block-diagram interface provides a visual representation of the computation chain Various point-and-click means are provided to support the user in creating, examining, and modifying block diagrams
Ptolomy [4] and N!Power are examples of systems that provide a block-diagram interface.
Limitations of Visual Interfaces
Although much in vogue, visual interfaces are inherently limited as a means for specifying signal computations
For example, the analogy between spreadsheets and spreadsheet-style speech software continues For simple signal computations, the spreadsheet-style interface can be very useful; computations are simple to set up and informative when operating For complicated computations, however, the spreadsheet-style interface inherits all of the worst features of spreadsheet programming It is difficult to encapsulate common sub-calculations, and it is difficult to organize the “program” so that the computational structure is self-evident The result is that spreadsheet-style programs are hard to write, hard to read, and error-prone
In this respect, block-diagram interfaces do a better job since their main focus is on the underlying computation rather than on the signals themselves Thus, screen “real-estate” is devoted to the computation rather than to the signal graphics However, as the complexity of computations grows, the geometric and visual approach eventually becomes unwieldy When was the last time you used a flowchart to design or document a program?
It follows that visual interfaces for specifying computations tend to be best suited for teaching and interactive exploration
50.6.3 Parametric Control of Operations
Speech processing operations often are based on complicated algorithms with numerous parameters Consequently, the means for specifying parameters is an important issue for speech software The simplest form of parametric control is provided by command-line options on command-line programs This is convenient, but can be cumbersome if there are many parameters A common alternative is to read parameter values from parameter files that are prepared in advance Typically, command-line values can be used to override values in the parameter file A third input source for parameter values is directly from the user in response to prompts issued by the program
Some systems offer the flexibility of a hierarchy of inputs for parameter values, for example:
• default values
Trang 8• values from a global parameter file read by all programs
• values from a program-specific parameter file
• values from the command line
• values from the user in response to run-time prompts
In some situations, it is helpful if a current default value is replaced by the most recent input from
a given parameter source We refer to this property as “parameter persistence”
50.7 Extensibility (Closed vs Open Systems)
Speech software is “closed” if there is no provision for the user to extend it There is a fixed set of operations available to process and display signals What you get is all you get
OS-based systems are always extensible to a degree because they inherit scripting capabilities from the OS, which permits the creation of new commands They may also provide programming libraries
so that the user can write and compile new programs and use them as commands
Workspace-based systems may be extensible if they are based on an interpreter whose programming language includes the concept of an encapsulated procedure If so, then users can write scripts that define new commands Some systems also allow the interpreter to be extended with commands that are implemented by underlying code in C or some other compiled language
In general, for speech software to be extensible, it must be possible to specify operations (see Section50.6) and also to re-use the resulting specifications in other contexts A block-diagram interface is extensible, for example, if a given diagram can be reduced to an icon that is available for use as a single block in another diagram
For speech software with visual interfaces, extensibility considerations also include the ability to specify new GUI controls (visible menus and buttons), the ability to tie arbitrary internal and external computations to GUI controls, and the ability to define new display methods for new signal types
In general, extended commands may behave differently from the built-in commands provided with the speech software For example, built-in commands may share a common user interface that is difficult to implement in an independent script or program (such a common interface might provide standard parameters for debug control, standard processing of parameter files, etc.)
If user-defined scripts, programs, and GUI components are indistinguishable from built-in facili-ties, we say that the speech software provides seamless extensibility
50.8 Consistency Maintenance
A speech processing chain involves signals, operations, and parameter sets An important consid-eration for speech software is whether or not consistency is maintained among all of these Thus, for example, if one input signal is replaced with another, are all intermediate and output signals recalculated automatically? Consistency maintenance is primarily an issue for speech software with visual interfaces, namely whether or not the software guarantees that all aspects of the visible displays are consistent with each other
Spreadsheet-style interfaces (for display-oriented software) and block-diagram interfaces (for compute-oriented software) usually provide consistency maintenance
Trang 950.9 Other Characteristics of Common Approaches
50.9.1 Memory-based vs File-based
“Memory-based” speech software carries out all of its processing and display operations on signals that are stored entirely within memory, regardless of whether or not the signals also have an external representation as a disk file This approach has obvious limitations with respect to signal size, but it simplifies programming and yields fast operation Thus, memory-based software is well-suited for teaching and the interactive exploration of small samples
In “file-based” speech software, on the other hand, signals are represented and manipulated as disk files The software partially buffers portions of the signal in memory as required for processing and display operations Although programming can be more complicated, the advantage is that there are no inherent limitations on signal size The file-based approach is, therefore, well-suited for large scale experiments
50.9.2 Documentation of Processing History
Modern speech processing involves complicated algorithms with many processing steps and operating parameters As a result, it is often important to be able to reconstruct exactly how a given signal was produced Speech software can help here by creating appropriate records as signal and parameter files are processed
The most common method for recording this information about a given signal is to put it in the same file as the signal Most modern speech software uses a file format that includes a “file header” that is used for this purpose Most systems store at least some information in the header, e.g., the
sampling rate of the signal Others, such as ESPS, attempt to store all relevant information In this
approach, the header of a signal file produced by any program includes the program name, values
of processing parameters, and the names and headers of all source files The header is a recursive structure, so that the headers of the source files themselves contain the names and headers of files that were prior sources Thus, a signal file header contains the headers of all source files in the processing chain It follows that files contain a complete history of the origin of the data in the file and all the intermediate processing steps The importance of record keeping grows with the complexity of computation chains and the extent of available parametric control
50.9.3 Personalization
There is considerable variation in the extent to which speech software can be customized to suit personal requirements and tastes Some systems cannot be personalized at all; they start out the same way, every time But most systems store personal preferences and use them again next time Savable preferences may include color selections, button layout, button semantics, menu contents, currently loaded signals, visible windows, window arrangement, and default parameter sets for speech processing operations
At the extreme, some systems can save a complete “snapshot” that permits exact resumption This
is particularly important for the interactive study of complicated signal configurations across repeated software sessions
50.9.4 Real-Time Performance
Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs
In the case of speech software, this usually means that the software can keep up with input speech Even this definition is not particularly meaningful unless the input speech is itself coming from a
Trang 10human speaker and digitized in real-time Otherwise, the real-issue is whether or not the software is fast enough to keep up with interactive use
For example, if one is testing speech recognition software by directly speaking into the computer, real-time performance is important It is less important, on the other hand, if the test procedure involves running batch scripts on a database of speech files
If the speech software is designed to take input directly from devices (or pipes, in the case of Unix), then the issue becomes one of CPU speed
50.9.5 Source Availability
It is unfortunate but true that the best documentation for a given speech processing command is often the source code Thus, the availability of source code may be an important factor for this reason alone Typically, this is more important when the software is used in advanced R&D applications Sources also are needed if users have requirements to port the speech software to additional platforms Source availability may also be important for extensibility, since it may not be possible to extend the speech software without the sources
If the speech software is interpreter-based, sources of interest will include the sources for any built-in operations that are implemented as interpreter scripts
50.9.6 Hardware Requirements
Speech software may require the installation of special purpose hardware There are two main reasons for such requirements: to accelerate particular computations (e.g., spectrograms), and to provide speech I/O with A/D and D/A converters
Such hardware has several disadvantages It adds to the system cost, and it decreases the overall reliability of the system It may also constrain system software upgrades; for example, the extra hardware may use special device drivers that do not survive OS upgrades Special purpose hardware used to be common, but is less so now owing to the continuing increase in CPU speeds and the prevalence of built-in audio I/O It is still important, however, when maximum speed and high-quality
audio I/O are important CSL is a good example of an integrated hardware/software approach.
50.9.7 Cross-Platform Compatibility
If your hardware platform may change or your site has a variety of platforms, then it is important
to consider whether the speech software is available across a variety of platforms Source availability (Section50.9.5) is relevant here
If you intend to run the speech software on several platforms that have different underlying numeric representations (a byte order difference being most likely), then it is important to know whether the file formats and signal I/O software support transparent data exchange
50.9.8 Degree of Specialization
Some speech software is intended for general purpose work in speech (e.g., ESPS/waves, MATLABTM) Other software is intended for more specialized usage Some of the areas where specialized software tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics,
clinical-voice, music, multi-media, sound and vibration, etc Two examples are HTK for recognition, and
Delta (from Eloquent Technology) for synthesis.