Thesis for the degree of master of Information Processing and Communication 2.2 Speech signal representations.... Thesis for the degree of master of Information Processing and Communicat
Trang 1HANOI UNIVERSITY OF TECHNOLOGY
THESIS FOR THE DEGREE OF MASTER
OF SCTENCE
STUDY AND DESIGN A PROCEDURE FOR
BUILDING SPEECH CORPORA FOR
MINORITY LANGUAGES IN VIETNAM
ĐOÀN THỊ NGỌC HIEN
Supervisor: Dr ERIC CASTELLI
HA NOT 2005
Trang 2
TIANOI UNIVERSITY OF TECIINOLOGY
TITESIS FOR TITE DEGREE OF MASTER
STUDY AND DESIGN A PROCEDURE FOR
BUILDING SPEECH CORPORA FOR MINORITY LANGUAGES IN VIETNAM
Trang 3For the Degree of
MASTER OF INFORMATION PROCESSING AND COMMUNICATION
Trang 4Thesis for the degree of master of Information Processing and Communication
Acknowledgments
During the course of my thesis work, there were many peuple who
were instrumental in helping me I would like to take this opportunity to
acknowledge sume of them
Firstly, T would like to express my gratitude to my supervisor, Dr ric
Castelli, whose expertise, understanding, patience, added considerably and
constructively crilical cye Lo my graduate experience
Special thanks go out to Dr Nguyen Trong Giang and Dr Pham Thi Kgoc Yen for supporting me the best convenient conditions during time
working in International Research Center MICA
I would like to thank to Ma Tran Do Dat who has a lot of experiences
in building a speech corpus database provided me helpful advices in the enuire
of researching and recording speech corpus
I would also like to thank my family, especially my parents for the
supporL thơy provided me through my cnlire life, withoul whose care,
encouragement 1 would not have finished this thesis
Finally, thanks go to all of my colleagues who helped me while I
worked wn this thesis
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 5Thesis for the degree of master of Information Processing and Communication
2.2 Speech signal representations
2.2.3 Linear predictive cođing cccoeceocoooec 2G
3.2 Theprogram of management ọ the VNSpcechCorpus
321 Sludy of SAM standard
323 Conversion of SAM signal into WAV signal - 3L
Trang 6Thesis for the degree of master of Information Processing and Communication
3.3.1 1.arge Vosabnlary Continnons Speech Recognition system for Vietnamese 56
33.2 Vietnamese Speech Synthesis - - 57 CIIAPTER 4, TITE VIETNAMESE MINORITY LANGUAGES
CHAPTERS THE SPEECH CORPUS AND THE ADAPTIVE
TECHNIQUES FOR RECORDING THE MINORITY CORPUS
Trang 7Thesis for the degree of master of Information Processing and Communication
List of Figures
Figure 2.1 Schematic diagram of Ihe human vocat mechanism 10 Figure 2.2 Block diagram of human speech production l2
Figure 2.3 Basic source-filter model for speech signals - - odd
Figure 2.4 (2) Waveform with (b) ils corresponding wideband spoctrogram 14 Darker areas mean higher energy ft thất time and ŸiequehcY eosocseeooooee 14 Figure 2.5 Conversion between log-energy values (in the x-axis) and gray scale (in the y-
Figure 2,7 Approximation of a tube with continuously varying area A(x) as a concatenation
Figuic 2.8 Junction between two lossless tnbcs 122
Figure 2.9 Coupling o£ the nasal cavity with the oral eaVïFy 24
Figure 2.10 Madel of the glottat excitation for voiced sounds 2⁄4 Figure 2.11 General disorete-tirne model oŸ speechh prodacfion - 35 Figure 2.12 Source-filter model for voiced and unvoiced speech 25 Figure 2.13 A mixed excitation source-filicr model of speach - 26 Figure 2.14 The orthogonality principte The prediction error is orlhogonat to the past
Figure 2.17 Triangular filters used in the computation o£ the mel-ccpstruin 38 Figure 3.1 The strueture of the VNSpeechCorpus -.46 Figure 3.2 Description of the nomenclature of the files in the SAM standard 47 Figure 3.3 Example of a file name of description of corpus - 4Ð Figure 3.4 The process of building the speech database 50 Vigure 3.5 ‘The relation between tables of the speech database - s0 Figure 3.6 The interface of the VNSpeechCurpus - - ene SB Figure 3.7 The result of search by word and type oŸ eoTpus 54
TDoan Thi Ngac Hicu _ Mastcr IPC 2003-2005
Trang 8Thesis for the degree of master of Information Processing and Communication
Figure 3.8 Table of Linear Predict Coding Coefficients
Vigure 3.9 ‘Table of Melt requency Coefficients
Figure 4.1 Ausiro-Asiatic Carnily graph
Figure 4.2 Austronesian fiurily graph
Figure 4.3 Tai-Kadai family graph
Figure 4.4 Miao-Yao Ñmily graph -2
Figure 4.5 Sino-Sibetan family graph
Figure 5.1 Portable Minidisc Recorder SONY Walkman MZ-N707
Figure 5.2 Sound Blaster Audigy 2 ZS Notchaok
Figue 5.3 USBPrs Microphone Iterfaee for Computer Audio Recording
Iigue 5.4'The waveferm and spectrogram of sentenice “Phôngv na t ix”
Trang 9Thesis for the degree of master of Information Processing and Conmumication
Abstract
In the recent years, duc to the researches in speech processing licld, the
scientists have gained many considerable results, especially in speech
recognition and synthesis and they arc applicd in many different fields of the
life, such as spcech-based accessibility systems For example, there has becn
much work in speech-based and auditory interfaces to allow visually impaired
users Lo acucss existing graphical interfaces In general, mulliple modalities
have been used to make human-computer interaction accessible for people with disabilities Since the year 90s, for studying all speech aspects, many
speech databases have been built in the world such as SpeechDat, SATA T-TT,
SPEECON In Vietnam, speech processing has been researched in recent
years and the International Research Center MICA (Multimedia Information
Communication and Applications) has built one large Vietnamese Speech
Database (it is called VNSpeechCorpus) including about 100 recorded hours
with al last 50 speakers in different recording environments However, while
many majority languages corpora have also created and made available
recently, less progress has been made in the creation of minority language
resources Realizing this problem and basing on the experiences in building
VNSpeechCorpus, we expect to design a procedure for building of speech
corpora of languages of the minorilics in Vietnam The specch material can be
chosen to characterize the vowels and consonants and the measured parameters will be the spectral characteristics of word (formants for the
vowels, fundamental frequency for the tone, vlc.)
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 10Thesis for the degree of master of Information Processing and Communication
Chapter 1 INTRODUCTION
So far, there were many projects of building speech corpus for majority
languages done such as ATR-JSDB, SpocchDal, SALA IL However, the
procedure of building speech corpus for majority languages can not be applied for minority languages because of the different characters between
majority and mmority languages and the residential arca of minorities
Therefore, study and design a procedure for building speech corpora for minority languages in Vietnam is the objective of this thesis
The thesis is implemented upon three following basis Firstly, the
Vietnamese speech database has been built in the Intemational Research
Center MICA, Hanoi University of Technology, Victnam Sccondly, it is the research on the history and characters of the minority languages in Vietnam
And the final base is the speech corpus for some minorities that have been
built in the world
‘To obtain the objective of thesis, we have to deal with four big
problems The first problem is to study the procedure of building the
Vietnamese speech database (it is called the VNSpeechCorpus) ‘he second problem is to design a program of management of the VNSpeechCorpus
because the first phase of building VNSpeechCorpus stopped in the recording
‘The next is to design a procedure for building speech corpora for minority languages And the last but not least is to experiment with a new procedure
for a minority language in Vietnam
This thesis is organized as follows Chapter 2 gives an overview on
specch signals and their representatives Chapter 3 discusses the building ol a Vietnamese speech corpus The languages of minorities in Vietnam are
studied in Chapter 4 Chapter 5 will introduce the adaptive techniques for
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 11Thesis for the degree of master of Information Processing and Communication
recording the corpus of minority languages And the Chapter 6 gives the
conelusions and presents a summary of the work done and future work
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 12Thesis for the degree of master of Information Processing and Communication
Chapter 2 SPEECH SIGNAL AND REPRESENTATIONS
2.1 Speech signal
2.1.1 Introduce
In considering the process of speech communication, il is helpful to
begin by thinking of a message represented in some abstract form in the brain
of the speaker Through the complex process of producing speech, the
information in thal message 1s ultimately converted to an acoustic signal The
message information can be thought of as being represented in a number of different ways in the process of speech production Tor example, the message
information is first converted into a set of ncural signals, which control the
articulatory mechanism (that is the motions of the tongue, lips, vocal cord ) The articulators move in response to these neural signals to perform a
sequence of gestures, the end result of which is an acoustic waveform, which
contains the information in the original message
The information that is communicated through speech is intrinsically of
a discrete nature, it can be represented by a concatenation of elements from a
fimte set of symbols The symbols from which every sound can be classified
are called phonemes Kach language has its own distinctive set of phonemes
For example, Vietnamese can be presented by a set of 46 phonemes
After the specch signal is gencrated and propagated to the listener, the
speech perception process begins Firstly, the listener processes the acoustic signal along the basilar membrane in the inner ear, which provides a running
spccch analysis of the incoming signal A neural transduction process
converts the spectral signal at the output of the basilar membrane into activity signals on the auditory nerve, corresponding roughly to a feature extraction
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 13Thesis for the degree of master of Information Processing and Communication
process In a manner that is not well understood, the neural activity along the auditory nerve is converted into a language code at the higher centers of processing within the brain, and finally message comprehension is achieved
2.1.2 Speech production process
Figure 2.1 places in evidence the important features of the human vocal system The main components of the system are the lungs, trachea (windpipe), larynx (organ of speech production), pharyngeal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose),
Figure 2.1 Schematic diagram of the human vocal mechanism
Doan Thi Ngoc Hien _ Master IPC 2003-2005
Trang 14Thesis for the degree of master of Information Processing and Communication
The vocal tract begins at the opening between the vocals cords, or
glottis and onds at lips, it consists of the pharynx and the mouth of oral cavity
In the average male, the total length of the vocal tract is about 17cm The
cross-sectional area of the vocal tract, determined by the positions of the
tongue, lips, jaw and vclum, varios from zero to about 200m" The nasal tract
begins at the velum and ends at the nostrils When the velum is lowered, the
nasal tract is acoustically coupled to the vocal tract to produce the nasal
scunds of the speech [1]
In studying the speech production process, it is helpful to abstract the
important features of the physical system in a manner, which leads lo a
realistic yet tractable mathematical model A black diagram of human speech
production is described in the figure 2.2 In this diagram, lungs and trachea is
considered a source of energy for the production of speech Speech is simply
the acoustic wave that is radiated from this source when air is expelled from
the lungs and the resullimg flow of air is perturbed by a constriction
somewhere in the vocal tract
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 15Thesis for the degree of master of Information Processing and Communication
Figure 2.2 Block diagram of human speech production
Speech sounds can be classified into three distinct classes according to their mode excitation Voiced sounds are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract Fricative or unvoiced sounds are generated by forming
a constriction at some point in the vocal tract and forcing air through the constriction at high enough velocity produce turbulence This creates a broad- spectrum noise source to excite the vocal tract Plosive sounds result from making a complete closure, building up pressure behind the closure and abruptly releasing it
Doan Thi Ngoc Hien _ Master IPC 2003-2005
Trang 16Thesis for the degree of master of Information Processing and Commumication
2.2 Speech signal representations
This section presents several representations for speech signals useful
in speech coding, synthesis and recognition The central theme is the
decomposition of the speech signal as a source passed through a lincar time-
varying filter This filter can be derived from models of speech production based on the theory of acoustics where the source represents the airflow at the
vocal cords, and the filter represents the resonances of the vocal tract, which
change over time Such a source-filter model is illustrated in Figure 2.3 We
describe methods ta compute both the source or excitation e[n] and the filter
A[n] from the speech signal x[7]
well as speech perception models (such as Mel-frequency cepstrum) Once
the filter has been estimated, the source can be obtained by passing the speech
signal through the inverse filler Scparalion between sourec and filler is one of
the most difficult challenges in speech processing lt turns out that phoneme
classification (either by human or by machines) is mostly dependent on the
characteristics of the filter ‘lraditionally, speech recognizers estimate the
filter characteristics and ignore the source Many speech synthesis techniques
use 4 source filler model because it allows {Mexibility in allering the pitch and
the filter Many speech coders also use this model because it allows a low bit
rate,
At first, the spectrogram is introduced as a representation of the speech
signal that highlights several of its properties and the short-time Fourier
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 17Thesis for the degree of master of Information Processing and Communication
analysis is the basic tool to build the spectrograms Then, this section will present several techniques used to separate source and filter: LPC and cepstral analysis, perceptually motivated models, formant tracking, and pitch tracking,
2.2.1 Short-time Fourier analysis
Speech processing research and technology are areas where the concept
of a Fourier representation has traditionally played a major role As we know,
a spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis A gray scale is typically used to indicate the energy at each point (4 f) with white representing low energy and black high energy In this section we cover short- time Fourier analysis, the basic tool with which to compute them
8
Figure 2.4 (a) Waveform with (b) its corresponding wideband spectrogram
Darker areas mean higher energy for that time and frequency
Doan Thi Ngoc Hien _ Master IPC 2003-2005
Trang 18Thesis for the degree of master of Information Processing and Communication
The idea behind a spectrogram, such as that in Figure 2.4, is to compute
a Fourier transformation every 5 millisceonds or so, displaying the energy at
each time/frequency point However, the signal is no longer periodic when longer segments are analyzed, and therefore the exact definition of Fourier
transform cannot be used Morcever, that definition requires knowledge of the
signal for infinite time For both reasons, a new set of techniques called short- time analysis, are proposed These techniques decompose the speech signal
inta a series of short segments, referred to as analysis frames, and analyze
each one independently
In Figure 2.4(a), mole the assumption that the signal can be
approximated as periodic within V and Y is reasonable In regions (4, W) and
(A, G), the signal is not periodic and looks like random noise The signal in
(4, W) appears to have different noisy characteristics than those of segment
(4, G) The use of an analysis frame implies that the region is short enough
for the behavior (periodicity or noise-lke appearance) of the signal to be
approximately constant If the region where speech seems periodic is too Jong, the pitch period is not constant and not all the periods in the region are
similar In esscnee, the specch region has to be short cnough so that the signal
is stationary 10 that region: i.e., the signal characteristics (whether periodicity
or noise-like appearance) are uniform in that region
Similarly to the filter banks described in Chapter 5, given a specch
signal x[»], we define the short-time signal of frame m as
Trang 19Thesis for the degree of master of Information Processing and Conmumication
While the window function can have different values for different
frames m, a popular choice is lo kecp il constant Lor all frames
where w[7|-0 for | >N/ 2 In practice, the window lonyth is on the order of
20 10 30ms, With the above framework, the short-lime Fourier representation
for frame m is defined as
To interpret this, assume the propertics of | | «x 7 persist outside the
window, and that, therefore, the signal is periodic with period AZ in the true
sense In this case, ils spectrum is a sum of impulses
Given that the Fourier transform of w[n] 1s:
So that the transform of w|m n| is W(eJe" Therefore, using the
convolution properly, the transform of x[a]w[m—x] for fixed m is the
convolution in the frequency domain which is a sum of weightedw(e*),
shifted on every harmonic ‘The short-time spectrum of a periodic signal
exhibits peaks (equally spaced 22/44 a part) representing the harmonics of the signal We estimate X,„ |&| from the short-time spectrum X„ [e*] , and we see
the importance of the length and choice of window
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 20Thesis for the degree of master of Information Processing and Conmumication
Equation (2.6) indicates that one cannot recover X,|4| by simply
retrieving, although the approximation can be reasonable if there is a small
value of ) such thai:
W(e™)x0 for w-a,
pitched voice with a #4) — S04 requires a reclangular window of at least 20 ms
and a Tlamming window of al least 40 ms for the condition in Ey.(2.7) la be
met If speech is non-slalionary within 40ms, Laking such a long window
implies obtaining an average spectrum during that segment instead of several
dislinet spectra For this reason, the reelangular window provides betlor Gime
resolution than the Hamming windaw
In practice, the Fourier transform in Eq.(2.3) is obtained through an FFT If the window has length N, the FFT has lo have a length greater than or equal to N Since FFT algorithms often have lengths that are powers of 2 (L=2*), the windowed signal with length N is augmented with (ZL N’} zeros
cither before, after, or both This process is called zero padding A larger
value of L provides a finer description of the discrete Fourier transform; but it does not increase the analysis frequency resolution: this is the sole mission of
the window length V
Trom the short-time Fourier analysis’ result, we can build the
spectrograms Spectrogram is 4 useful speech signal representation in specch
recognition and synthesis Each phoneme is distinguished by its own unique
pattem in the spectrogram For vaiced phonemes, the signature involves large
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 21Thesis for the degree of master of Information Processing and Communication
concentrations of energy called formants; within each formant, and typically
across all avlive formants, there is a charactorisuc waxing and waning of
energy in all frequencies which is the most salient characteristic of what we call the human voice; this cyclic pattem is caused by the repetitive opening
and closing of the vocal cords which occurs at an average of 125 times per
second in the average adult male, and approximately twice as fast (250 Hz) in the adult female, giving rise to the sensation of pitch
Since the spectrogram displays just the energy and not the phase of the
short-term Fourier transform, we compute the energy as in Eq.(2.8) with this
value converted Lo a gray scale according lo Figure 2.5 Pixels whose values
have not been computed are interpolated ‘Ihe slope controls the contrast of
the spectrogram, while the saturation points for white and black control the
dynamic range | arger log-energies correspond to a darker gray color
‘There are two main types of spectrograms: narrow-band and wide-
band, Wide-band spectrograms use relatively shorl windows (< 10 ms) and
thus have good time resolution at the expense of lower frequency resolution, since the corresponding filters have wide bandwidths - 200 Hz) and the
TDoan Thi Ngac Hicn _ Mastcr [PC 2003-2005
Trang 22Thesis for the degree of master of Information Processing and Communication
harmonics cannot be seen Spectrograms can aid in determining formant
frequencies and fundamental frequency, as well as voiced and unvoiced
regions
2.2.2 Acoustic model of speech production
Speech is a sound wave created by vibration that is propagated in the
air Acoustic theory analyzes the laws of physics that govern the propagation
of sound in the vocal tract Such a theory should consider three-dimensional wave propagation, the variation of the vocal tract shape with time, losses due
to heat conduction and viscous friction at the vocal tract walls, softness of the
tract walls, radiation of sound at the lips, nasal coupling and excitation of
sound While a detailed model that considers all of the above is not yet
available, some madels provide a good approximation in practice, as well as a
good understanding of the physics involved
© Glottal Excitation
In the process of speech production, the vocal cords constrict the path from the lungs to the vocal tract ‘This is illustrated in Figure 2.6, volume
velocity is zero during the closed-phase, during which the vocal cords are
closed As lung pressure is increased, air {lows out of the lungs and through
the opening between the vocal cords (e/oifis) At one point the vocal cords are together, thereby blocking the airflow, which builds up pressure behind them
Eventually the pressure reaches a level sufficient lo ferce the vocal cords Lo
open and thus allow air to flow through the glottis Then, the pressure in the
glottis falls and, uf the tension in the vocal cords is properly adjusted, the reduced pressure allows the cords to come together, and the cycle is repeated
This condition of sustained oscillation occurs for voiced sounds The closed- phase of the oscillation takes place when the glollis is closed and the volume
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 23Thesis for the degree of master of Information Processing and Communication
velocity is zero The openphase is characterized by a non-zero volume
velocily, in which the lungs and the vocal tract are coupled
Ẵ+—”m
41mg glots Open glotus
Figure 2.6 Glottal excitation
w Lossless Tube Concatenation
A widely used model for speech production is based on the assumption
that the vocal tract can be represented as a concatenation of lossless tubes
The constant cross-sectional areas { } 4 of the tubes approximate the area
function 4(x) of the vocal tract Tf'a large number of tubes of short length are
used, we reasonably expect the frequency response of the concatenated tubes
to be cluse to those of a tube with continuously varying area function For
frequencies corresponding to wavelengths that are long compared to the
dimensions of the vocal tract, it is reasonable to assume plane wave
propagation along the axis of the tubes Lin addition we assume that there arc
no losses due to viscosity or thermal conduction, and that the area A does not
change over lime, the sound waves in the tube satisfy the following pair of
'where p(~,/}1s the sound pressure ¡n the tube at position x and time /, ø(zx.)
is the volume velocity flow in the tube at position x and time f, ø is the
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 24Thesis for the degree of master of Information Processing and Communication
density of air in the tube, c is the velocity of sound and 4 is the cross- scclional arca of the tube
Figure 2.7 Approximation of a tube with continuously varying area A() asa
concatenation of 5 lossless acoustic tubes
Since Eys.(2.9) are linear, the pressure and volume velocity in tube
A" are related by
s(s2)—n( #2) s0 x2)
where wi (t—x/c)and u,(¢—x/c)are the traveling waves in the positive and
nogalive directions respectively and x is the dislance measured from the Icf- hand end of tubes" 0<x
solution by substituting Iq (2.10) into (2.9)
‘When there 1s a junction between two tubes, as in Figure 2.8, part of the
<1 The reader can prove that this is indeed the
Trang 25Thesis for the degree of master of Information Processing and Communication
Figure 2.8 Junction between two lossless tubes
A relationship between the z-transforms of the volume velocity at the glottis x,[s] and the lips ø;[z]for a concatenation of N lossless tubes can be
derived using a discrete-time version of Fq.(2.10) and taking into account boundary conditions for every junction
equivalent area at the lips %, and , arc the equivalent impedances at the
glottis and lips, respectively Such impedances relate the volume velocity and
pressure, for lhe lips the expression is
Trang 26Thesis for the degree of master of Information Processing and Communication
mast W/2 complex conjugate poles, or resonances or formants These
resonances occur when a given lrequency gels frapped in the voval tact
because it is reflected back at the lips and then again back at the glottis Since each tube has length ?/ and there are N of them, the total length is L-V The
propagation delay in cach tube ø—//c „ and the sampling period is T—2r, the
round trip in a tube We can find a relationship between the number of tubes
N and the sampling frequency #2 _ 1/7:
which is 0 for low frequencies and reaches R, asymptotically This
dependency upon frequency results in a reflection coctficient that is also a
Junction of frequency For low frequencies, x,—1, and no loss occurs Al
higher frequencies, loss by radiation translates into widening of formant
bandwidths
In the production of the nasal consonants, the velum is lowered to trap
the nasal tract to the pharynx, whereas a complete closure is fonned in the
oral tact (/m/ at the lips, /n/ just back of the leeth and /ng/ just forward of the
velum itself ‘his configuration is shown in Figure 2.9, which shows two
branches, one of them completely closed For nasals, the radiation occurs
primarily ai lhe nostrils The sct of resonances is detormined by the shape and
length of the three tubes At certain frequencies, the wave reflected in the closure cancels the wave at the pharynx, preventing energy from appearing at
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 27Thesis for the degree of master of Information Processing and Communication
nostrils The result is that for nasal sounds, the vocal tract transfer function
We) has anti-resonances (eros) in addilion to resonances It has also been
observed that nasal resonances have broader bandwidths than non-nasal
voiced sounds, due to the greater viscous friction and thermal loss because of
the large surface arca of the nasal cavity
Joe
tine AL
Gohs[] Vv —> ——> €) closure
Figure 2.9 Coupling of the nasal cavity with the oral cavity
* Source- Filter Models of Speech Production
Speech signals are captured by microphones that respond to changes in
air pressure Thus, it is of interest ta compute the pressure at the lips 72(z) ,
which can be obtained as
P(2)=U,(2)2, (2)=U, (2K, @)Z 6) @Q17
For voiced sounds we can model z„
| as an impulse train convolved
with g[7], the glottal pulse Since g[z] is of finite length, ils z-transform is an
Figure 2.10 Modcl of the glottal excitation for voiced sounds
The complete model for both voiced and unvoived sounds is shown in
Trang 28Thesis for the degree of master of Information Processing and Communication
Figure 2.14 General discrete-time model of speech production
‘The excitation can be either an impulse train with period 7? and amplitude 4, driving a filter G(=) or random noise with amplitude 4,
We can simplify the model in Figure 2.11 by grouping C{=), ¥(s), and
(
unvoiced sounds ‘The simplified model is shown in Figure 2.12, where we
Z,(2) mlo ig(s) Lor voiced sounds, and ¥{z) and } imto £() for
make explicit the fact that the filter changes over time
This model is a decent approximation, but fails on voiced fricatives,
since those sounds contain both a periodic componcnt and an aspirated
component In this case, a mixed excitation model can be applied, using for
voiced sounds 4 sum of both an impulse train and colored nuise
The model in Figure 2.13 is appealing because the source is white (has
a flat spectrum) and all the coloring is in the filter Other source-filter
decompositions allempt to model the source as the signal at the gloits, in
which the source is definitely not white Since G(s), 7,(+) contain zeros, and V{z) can also contain zeros for nasals, H(=) is no longer all-pole
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 29Thesis for the degree of master of Information Processing and Communication
Figure 2.13 A mixed excitation source-filter model of speech
2.2.3 Linear predictive coding
A very powerful method for speech analysis is based on #mear
predictive coding (LPC) This method is widely used because it is fast and
simple, yet an effective way of estimating the main parameters of speech
signals
An allpole filler with a sufficient number of polos is a good
approximation for speech signals ‘hus, we could model the filter H(=) in
Linear predictive coding gets its name from the fact that it predicts the
current sample as a linear combination of its past p samples
The prediction error when using this approximation is
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 30Thesis for the degree of master of Information Processing and Communication
« The Orthogonality Principle
To estimate the predictor coclTicients from a sol of specch samples, we
use the short-term analysis technique Let’s define x,(”} as a segment of
speech selected in the vicinity of sample m:
Given a signal x,|:| , we estimate its corresponding LPC coefficients
as those that mmumize the total prediction exror #, Taking the derivative of
Eq, (2.24) with respect to « and equating to 0, we obtain:
Trang 31Thesis for the degree of master of Information Processing and Conmumication
For convenience, we can define the correlation coefficients as
Solution of the set of p linear equations results in the p LPC
cocflicionts that mmimize the predicion orror With , salisliying Eq, (2.28),
the total prediction error in iq.(2.24) takes on the following value:
* Solution of the LPC Equations
There arc 3 different algorithms for the solution of the 1.PC Equations: the covariance method, the autocorrelation method and the lattice method In this scelion, we present the autocorrelation method, which is used in the
Trang 32Thesis for the degree of master of Information Processing and Conmumication
with w[7] beg 4 window (such as a Ilamming window) which is 0 outside
the interval O<#<N With this assumption, the corresponding prediction
error ¢,|7| is non-zero over the interval 0<<W 1 p, and, therefore, the total
prediction error takes on the value
The matrix in Ey (2.39) is symmetric and all the elemenls in its
diagonals arc identical Such matrices arc called Toeplitz, Durbin’s recursion
exploits this fact resulting in a very efficient algorithm (for convenience, we
omit the subscript m of the autocorrelation function)
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 33Thesis for the degree of master of Information Processing and Conmumication
Where the covllicionts &, called reflection coefficients, arc bounded between
] and 1.In the process of computing the predictor coefficients of order p, the
recursion finds the solution of the predictor cocflicients for all orders less than
p Replacing Rj] by the normalized autocorrelation coefficients 7[ j] , defined
as
results in identical LPC coefficients, and the recursion is more robust to
problems with arithmetic precision Likewise, the normalized prediction error
at iteration i is defined by dividing Eq (2.29) by R[O|, which, using Fg (2.36)
Trang 34Thesis for the degree of master of Information Processing and Communication
Using Eq.(2.22), we can compute the prediction error signal, also called
the exciiation, or residual signal For unvoiced spccch synthetically generated
by white noise following an LPC filter we expect the residual to be approximately white noise In practice, this approximation is quite good, and
replacement of the residual by whitc noise followed by the LPC filter
typically results in no audible difference For voiced speech synthetically generated by an impulse train following an LPC filter, we expect the residual
to approximate an impulse train In practice, this is not the case, because the
all-pole assumption is not altogether valid, thus, the residual, although it
conlains spikes, is far [rom an impulse train Replacing the residual by an
impulse train, followed by the LPC filter, results in speech that sounds
somewhat robotic, partly because real speech is not perfectly periodic (it has a
random component as well), and because the zeroes are not modeled with the
LPC filter
Ilow do we choose p? This is an important design question Unvoiced
speech has higher error than voiced speech, because the LPC model is more accurate for voiced speech In general, the normalized error rapidly decreases,
and then converges lo a value of around 12 - 14 lor 8 kHz speech If we use a
large value of p, we are fitting the individual harmonics, thus the LPC filter is modeling the source, and the separation between source and filter is not going
to be so goad The more cociticicnts we have to estimate, the larger the
variance of their estimates, since the number of available samples is the same
A tule of thumb is to use 1 complex pole per kIIz plus 2 - 4 poles to model
the radiation and glottal effects
Tor unvoiced speech, both the autocorrelation and the covariance
methods provide similar results For voiced speech, however, the covariance method can provide better estimates if the analysis window is shorter than the
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 35Thesis for the degree of master of Information Processing and Communication
local pitch period and the window only includes samples from the closed
phase (when the vocal tract is closed al the glottis and specch signal is duc
mainly to free resonances) ‘his is called pitch synchronous analysis and
results in lower prediction error, because the true excitation is close to zero
during the whole analysis window During the open phase, the trachea, the
vocal folds, and the voeal tract are acoustically coupled, and this coupling will change the free resonances Additionally, the prediction error is higher for
both the autocorrelation and the covariance methods if samples from the open
phase are included in the analysis window, because the prediction during
those insLants is poor
In this section we introduce the cepstrum as onc homomorphic
transformation that allows us to scparate the source from the filter We show
thal we can find a value N such that the cepstrum of the Ger 4[n]x0 for
n=N , and that the cepstrum of the excitation é|n|*0 for #<N With this
assumption, we can approximately recover both e|n| and A|n| from z|n| by
homomorphic filtering In Figure 2.15, we show how to recover A[n| with a
Trang 36Thesis for the degree of master of Information Processing and Communication
Where D is the cepstrum operator
« The Real and Complex Cepstrum
The real cepstrum of a digital signal x|71| is defined as and the complex
You can see from Eqs.(2.53) and (2.54) that both the real and the
complex cepstrum satisfy Hq.(2.50) and thus they are homomorphic
transformations If the signal x]z] is real, both the real cepstrum e|z| and the
complex cepstrum %[nJare also real signals Therefore the term complex
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 37Thesis for the degree of master of Information Processing and Communication
cepstrum doesn’t mean that it is a complex signal but rather that the complex
From here on, when we refer to cepstrum without qualifiers, we are
referring to the real cepstrum, since it is the most widely used in speech
technology The cepstrum was invented by Bogert and its term was coined by
reversing the first syllable of the word spectrum, given that it is obtained by taking the inverse Fourier transform of the log-spectrum Similarly, they
defined the term guefrency to represent the independent variable n in c[r]
The quefrency has dimension of time
* Ceptrum of Speech Signals
We can compute the cepstrum of a speech segment by windowing the
signal with a window of length N directly through its definition of Hq.(2.54),
using the DEI as follows:
This aliasing introduces errors in the estimation that can be reduced by
choosing 4 large value for N Computation of the complex copstrum requires
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 38Thesis for the degree of master of Information Processing and Conmumication
computing the complex logarithm and, in turn, the phase However, given the
principal value of the phase 4,[&] , there are mlinite possible values Lor [4]
If x[x] is real, arg[_¥(e*) ]is an odd function and also continuos Thus
we can do phase unwrapping by choosmg », to guarantee that @[k] is a
smooth function, ie, by forcing the difference between adjacent values to be
small:
For unvoiced speech, the unwrapped phase is random, and therclore
only the real cepstrum has meaning In practical situations, even voiced
spooch has some frequencies al which noisc dominates (lypically very low and high frequencies), which results in phase @[k] that changes drastically
from frame to frame Because of this, the complex cepstrum in Eq (2.54) is
rarely used for real speech signals Instead, the real cepstrum is used much
Similarly, it can be shown that lor the new real cepstrum e,[x] is an
aliased version of ¢[#] given by which again has aliasing that can be reduced
dy choosing a large value for N
2.2.5 Perceptually- Motivated representations
In this section we describe some aspects of human perception, and method motivated by the behavior of the human auditory system: Mel-
Frequency Copstrum Coefficients (MFCC) This method has been
TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005
Trang 39Thesis for the degree of master of Information Processing and Conmumication
successfully used in speech recognition First we present several nonlinear
frequency scales that have been used in such representations
« The bilinear Transform
The transformation
for 0<a<1 belongs to the class of bilinear transforms It is a mapping in the
complex plane that maps the unit circle onto itself The frequency
transformation is obtained by making the substitution z-e!* and s—e”
T acos(o}
Oppenheim showed that the advantage of this transformation is that it
can be used to transform a time sequence in the linear frequency into another
time sequence in the warped frequency ‘This bilinear transform has been
successfully applied to cepstral and autocorrelation coefficients Both sets of
coefficients are causal The input is the time-reversed cepstrum sequence, and the output can be obtained by sampling the outputs of the filters at time n — 0
The filters used for w[m] m > 2 are the same Note that, for a finite length
cepstrum, an infinite-length warped copstrum results
Trang 40Thesis for the degree of master of Information Processing and Communication
For a finite number of cepstral coefficients the bilinear transform in
Figure 2.16 results in an infimite number of warped cepstral cociTicients
Since truncation is usually done in practice, the bilinear transform is
equivalent to a matrix multiplication, where the mairix is a function of the
‘warping paramcter
The Mel-Hrequency Cepstrum Coefficients (MKCC) is a representation
defined as the real cepstrum of a windowed short-time signal derived from the
FFT of that signal The difference from the real cepstrum 1s that a nonlincar
frequency scale is used, which approximates the behavior of the auditory
system The MICC representation is beneficial for speech recognition Given
the DFT of the input signal