1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn thạc sĩ study and design a procedure for building speech corpora for minority languages in vietnam

96 1 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Study and design a procedure for building speech corpora for minority languages in Vietnam
Tác giả Đồn Thị Ngọc Hiền
Người hướng dẫn Dr. Eric Castelli
Trường học Hanoi University of Technology
Chuyên ngành Information Processing and Communication
Thể loại Thesis
Năm xuất bản 2005
Thành phố Hà Nội
Định dạng
Số trang 96
Dung lượng 1,36 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Thesis for the degree of master of Information Processing and Communication 2.2 Speech signal representations.... Thesis for the degree of master of Information Processing and Communicat

Trang 1

HANOI UNIVERSITY OF TECHNOLOGY

THESIS FOR THE DEGREE OF MASTER

OF SCTENCE

STUDY AND DESIGN A PROCEDURE FOR

BUILDING SPEECH CORPORA FOR

MINORITY LANGUAGES IN VIETNAM

ĐOÀN THỊ NGỌC HIEN

Supervisor: Dr ERIC CASTELLI

HA NOT 2005

Trang 2

TIANOI UNIVERSITY OF TECIINOLOGY

TITESIS FOR TITE DEGREE OF MASTER

STUDY AND DESIGN A PROCEDURE FOR

BUILDING SPEECH CORPORA FOR MINORITY LANGUAGES IN VIETNAM

Trang 3

For the Degree of

MASTER OF INFORMATION PROCESSING AND COMMUNICATION

Trang 4

Thesis for the degree of master of Information Processing and Communication

Acknowledgments

During the course of my thesis work, there were many peuple who

were instrumental in helping me I would like to take this opportunity to

acknowledge sume of them

Firstly, T would like to express my gratitude to my supervisor, Dr ric

Castelli, whose expertise, understanding, patience, added considerably and

constructively crilical cye Lo my graduate experience

Special thanks go out to Dr Nguyen Trong Giang and Dr Pham Thi Kgoc Yen for supporting me the best convenient conditions during time

working in International Research Center MICA

I would like to thank to Ma Tran Do Dat who has a lot of experiences

in building a speech corpus database provided me helpful advices in the enuire

of researching and recording speech corpus

I would also like to thank my family, especially my parents for the

supporL thơy provided me through my cnlire life, withoul whose care,

encouragement 1 would not have finished this thesis

Finally, thanks go to all of my colleagues who helped me while I

worked wn this thesis

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 5

Thesis for the degree of master of Information Processing and Communication

2.2 Speech signal representations

2.2.3 Linear predictive cođing cccoeceocoooec 2G

3.2 Theprogram of management ọ the VNSpcechCorpus

321 Sludy of SAM standard

323 Conversion of SAM signal into WAV signal - 3L

Trang 6

Thesis for the degree of master of Information Processing and Communication

3.3.1 1.arge Vosabnlary Continnons Speech Recognition system for Vietnamese 56

33.2 Vietnamese Speech Synthesis - - 57 CIIAPTER 4, TITE VIETNAMESE MINORITY LANGUAGES

CHAPTERS THE SPEECH CORPUS AND THE ADAPTIVE

TECHNIQUES FOR RECORDING THE MINORITY CORPUS

Trang 7

Thesis for the degree of master of Information Processing and Communication

List of Figures

Figure 2.1 Schematic diagram of Ihe human vocat mechanism 10 Figure 2.2 Block diagram of human speech production l2

Figure 2.3 Basic source-filter model for speech signals - - odd

Figure 2.4 (2) Waveform with (b) ils corresponding wideband spoctrogram 14 Darker areas mean higher energy ft thất time and ŸiequehcY eosocseeooooee 14 Figure 2.5 Conversion between log-energy values (in the x-axis) and gray scale (in the y-

Figure 2,7 Approximation of a tube with continuously varying area A(x) as a concatenation

Figuic 2.8 Junction between two lossless tnbcs 122

Figure 2.9 Coupling o£ the nasal cavity with the oral eaVïFy 24

Figure 2.10 Madel of the glottat excitation for voiced sounds 2⁄4 Figure 2.11 General disorete-tirne model oŸ speechh prodacfion - 35 Figure 2.12 Source-filter model for voiced and unvoiced speech 25 Figure 2.13 A mixed excitation source-filicr model of speach - 26 Figure 2.14 The orthogonality principte The prediction error is orlhogonat to the past

Figure 2.17 Triangular filters used in the computation o£ the mel-ccpstruin 38 Figure 3.1 The strueture of the VNSpeechCorpus -.46 Figure 3.2 Description of the nomenclature of the files in the SAM standard 47 Figure 3.3 Example of a file name of description of corpus - 4Ð Figure 3.4 The process of building the speech database 50 Vigure 3.5 ‘The relation between tables of the speech database - s0 Figure 3.6 The interface of the VNSpeechCurpus - - ene SB Figure 3.7 The result of search by word and type oŸ eoTpus 54

TDoan Thi Ngac Hicu _ Mastcr IPC 2003-2005

Trang 8

Thesis for the degree of master of Information Processing and Communication

Figure 3.8 Table of Linear Predict Coding Coefficients

Vigure 3.9 ‘Table of Melt requency Coefficients

Figure 4.1 Ausiro-Asiatic Carnily graph

Figure 4.2 Austronesian fiurily graph

Figure 4.3 Tai-Kadai family graph

Figure 4.4 Miao-Yao Ñmily graph -2

Figure 4.5 Sino-Sibetan family graph

Figure 5.1 Portable Minidisc Recorder SONY Walkman MZ-N707

Figure 5.2 Sound Blaster Audigy 2 ZS Notchaok

Figue 5.3 USBPrs Microphone Iterfaee for Computer Audio Recording

Iigue 5.4'The waveferm and spectrogram of sentenice “Phôngv na t ix”

Trang 9

Thesis for the degree of master of Information Processing and Conmumication

Abstract

In the recent years, duc to the researches in speech processing licld, the

scientists have gained many considerable results, especially in speech

recognition and synthesis and they arc applicd in many different fields of the

life, such as spcech-based accessibility systems For example, there has becn

much work in speech-based and auditory interfaces to allow visually impaired

users Lo acucss existing graphical interfaces In general, mulliple modalities

have been used to make human-computer interaction accessible for people with disabilities Since the year 90s, for studying all speech aspects, many

speech databases have been built in the world such as SpeechDat, SATA T-TT,

SPEECON In Vietnam, speech processing has been researched in recent

years and the International Research Center MICA (Multimedia Information

Communication and Applications) has built one large Vietnamese Speech

Database (it is called VNSpeechCorpus) including about 100 recorded hours

with al last 50 speakers in different recording environments However, while

many majority languages corpora have also created and made available

recently, less progress has been made in the creation of minority language

resources Realizing this problem and basing on the experiences in building

VNSpeechCorpus, we expect to design a procedure for building of speech

corpora of languages of the minorilics in Vietnam The specch material can be

chosen to characterize the vowels and consonants and the measured parameters will be the spectral characteristics of word (formants for the

vowels, fundamental frequency for the tone, vlc.)

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 10

Thesis for the degree of master of Information Processing and Communication

Chapter 1 INTRODUCTION

So far, there were many projects of building speech corpus for majority

languages done such as ATR-JSDB, SpocchDal, SALA IL However, the

procedure of building speech corpus for majority languages can not be applied for minority languages because of the different characters between

majority and mmority languages and the residential arca of minorities

Therefore, study and design a procedure for building speech corpora for minority languages in Vietnam is the objective of this thesis

The thesis is implemented upon three following basis Firstly, the

Vietnamese speech database has been built in the Intemational Research

Center MICA, Hanoi University of Technology, Victnam Sccondly, it is the research on the history and characters of the minority languages in Vietnam

And the final base is the speech corpus for some minorities that have been

built in the world

‘To obtain the objective of thesis, we have to deal with four big

problems The first problem is to study the procedure of building the

Vietnamese speech database (it is called the VNSpeechCorpus) ‘he second problem is to design a program of management of the VNSpeechCorpus

because the first phase of building VNSpeechCorpus stopped in the recording

‘The next is to design a procedure for building speech corpora for minority languages And the last but not least is to experiment with a new procedure

for a minority language in Vietnam

This thesis is organized as follows Chapter 2 gives an overview on

specch signals and their representatives Chapter 3 discusses the building ol a Vietnamese speech corpus The languages of minorities in Vietnam are

studied in Chapter 4 Chapter 5 will introduce the adaptive techniques for

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 11

Thesis for the degree of master of Information Processing and Communication

recording the corpus of minority languages And the Chapter 6 gives the

conelusions and presents a summary of the work done and future work

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 12

Thesis for the degree of master of Information Processing and Communication

Chapter 2 SPEECH SIGNAL AND REPRESENTATIONS

2.1 Speech signal

2.1.1 Introduce

In considering the process of speech communication, il is helpful to

begin by thinking of a message represented in some abstract form in the brain

of the speaker Through the complex process of producing speech, the

information in thal message 1s ultimately converted to an acoustic signal The

message information can be thought of as being represented in a number of different ways in the process of speech production Tor example, the message

information is first converted into a set of ncural signals, which control the

articulatory mechanism (that is the motions of the tongue, lips, vocal cord ) The articulators move in response to these neural signals to perform a

sequence of gestures, the end result of which is an acoustic waveform, which

contains the information in the original message

The information that is communicated through speech is intrinsically of

a discrete nature, it can be represented by a concatenation of elements from a

fimte set of symbols The symbols from which every sound can be classified

are called phonemes Kach language has its own distinctive set of phonemes

For example, Vietnamese can be presented by a set of 46 phonemes

After the specch signal is gencrated and propagated to the listener, the

speech perception process begins Firstly, the listener processes the acoustic signal along the basilar membrane in the inner ear, which provides a running

spccch analysis of the incoming signal A neural transduction process

converts the spectral signal at the output of the basilar membrane into activity signals on the auditory nerve, corresponding roughly to a feature extraction

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 13

Thesis for the degree of master of Information Processing and Communication

process In a manner that is not well understood, the neural activity along the auditory nerve is converted into a language code at the higher centers of processing within the brain, and finally message comprehension is achieved

2.1.2 Speech production process

Figure 2.1 places in evidence the important features of the human vocal system The main components of the system are the lungs, trachea (windpipe), larynx (organ of speech production), pharyngeal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose),

Figure 2.1 Schematic diagram of the human vocal mechanism

Doan Thi Ngoc Hien _ Master IPC 2003-2005

Trang 14

Thesis for the degree of master of Information Processing and Communication

The vocal tract begins at the opening between the vocals cords, or

glottis and onds at lips, it consists of the pharynx and the mouth of oral cavity

In the average male, the total length of the vocal tract is about 17cm The

cross-sectional area of the vocal tract, determined by the positions of the

tongue, lips, jaw and vclum, varios from zero to about 200m" The nasal tract

begins at the velum and ends at the nostrils When the velum is lowered, the

nasal tract is acoustically coupled to the vocal tract to produce the nasal

scunds of the speech [1]

In studying the speech production process, it is helpful to abstract the

important features of the physical system in a manner, which leads lo a

realistic yet tractable mathematical model A black diagram of human speech

production is described in the figure 2.2 In this diagram, lungs and trachea is

considered a source of energy for the production of speech Speech is simply

the acoustic wave that is radiated from this source when air is expelled from

the lungs and the resullimg flow of air is perturbed by a constriction

somewhere in the vocal tract

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 15

Thesis for the degree of master of Information Processing and Communication

Figure 2.2 Block diagram of human speech production

Speech sounds can be classified into three distinct classes according to their mode excitation Voiced sounds are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract Fricative or unvoiced sounds are generated by forming

a constriction at some point in the vocal tract and forcing air through the constriction at high enough velocity produce turbulence This creates a broad- spectrum noise source to excite the vocal tract Plosive sounds result from making a complete closure, building up pressure behind the closure and abruptly releasing it

Doan Thi Ngoc Hien _ Master IPC 2003-2005

Trang 16

Thesis for the degree of master of Information Processing and Commumication

2.2 Speech signal representations

This section presents several representations for speech signals useful

in speech coding, synthesis and recognition The central theme is the

decomposition of the speech signal as a source passed through a lincar time-

varying filter This filter can be derived from models of speech production based on the theory of acoustics where the source represents the airflow at the

vocal cords, and the filter represents the resonances of the vocal tract, which

change over time Such a source-filter model is illustrated in Figure 2.3 We

describe methods ta compute both the source or excitation e[n] and the filter

A[n] from the speech signal x[7]

well as speech perception models (such as Mel-frequency cepstrum) Once

the filter has been estimated, the source can be obtained by passing the speech

signal through the inverse filler Scparalion between sourec and filler is one of

the most difficult challenges in speech processing lt turns out that phoneme

classification (either by human or by machines) is mostly dependent on the

characteristics of the filter ‘lraditionally, speech recognizers estimate the

filter characteristics and ignore the source Many speech synthesis techniques

use 4 source filler model because it allows {Mexibility in allering the pitch and

the filter Many speech coders also use this model because it allows a low bit

rate,

At first, the spectrogram is introduced as a representation of the speech

signal that highlights several of its properties and the short-time Fourier

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 17

Thesis for the degree of master of Information Processing and Communication

analysis is the basic tool to build the spectrograms Then, this section will present several techniques used to separate source and filter: LPC and cepstral analysis, perceptually motivated models, formant tracking, and pitch tracking,

2.2.1 Short-time Fourier analysis

Speech processing research and technology are areas where the concept

of a Fourier representation has traditionally played a major role As we know,

a spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis A gray scale is typically used to indicate the energy at each point (4 f) with white representing low energy and black high energy In this section we cover short- time Fourier analysis, the basic tool with which to compute them

8

Figure 2.4 (a) Waveform with (b) its corresponding wideband spectrogram

Darker areas mean higher energy for that time and frequency

Doan Thi Ngoc Hien _ Master IPC 2003-2005

Trang 18

Thesis for the degree of master of Information Processing and Communication

The idea behind a spectrogram, such as that in Figure 2.4, is to compute

a Fourier transformation every 5 millisceonds or so, displaying the energy at

each time/frequency point However, the signal is no longer periodic when longer segments are analyzed, and therefore the exact definition of Fourier

transform cannot be used Morcever, that definition requires knowledge of the

signal for infinite time For both reasons, a new set of techniques called short- time analysis, are proposed These techniques decompose the speech signal

inta a series of short segments, referred to as analysis frames, and analyze

each one independently

In Figure 2.4(a), mole the assumption that the signal can be

approximated as periodic within V and Y is reasonable In regions (4, W) and

(A, G), the signal is not periodic and looks like random noise The signal in

(4, W) appears to have different noisy characteristics than those of segment

(4, G) The use of an analysis frame implies that the region is short enough

for the behavior (periodicity or noise-lke appearance) of the signal to be

approximately constant If the region where speech seems periodic is too Jong, the pitch period is not constant and not all the periods in the region are

similar In esscnee, the specch region has to be short cnough so that the signal

is stationary 10 that region: i.e., the signal characteristics (whether periodicity

or noise-like appearance) are uniform in that region

Similarly to the filter banks described in Chapter 5, given a specch

signal x[»], we define the short-time signal of frame m as

Trang 19

Thesis for the degree of master of Information Processing and Conmumication

While the window function can have different values for different

frames m, a popular choice is lo kecp il constant Lor all frames

where w[7|-0 for | >N/ 2 In practice, the window lonyth is on the order of

20 10 30ms, With the above framework, the short-lime Fourier representation

for frame m is defined as

To interpret this, assume the propertics of | | «x 7 persist outside the

window, and that, therefore, the signal is periodic with period AZ in the true

sense In this case, ils spectrum is a sum of impulses

Given that the Fourier transform of w[n] 1s:

So that the transform of w|m n| is W(eJe" Therefore, using the

convolution properly, the transform of x[a]w[m—x] for fixed m is the

convolution in the frequency domain which is a sum of weightedw(e*),

shifted on every harmonic ‘The short-time spectrum of a periodic signal

exhibits peaks (equally spaced 22/44 a part) representing the harmonics of the signal We estimate X,„ |&| from the short-time spectrum X„ [e*] , and we see

the importance of the length and choice of window

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 20

Thesis for the degree of master of Information Processing and Conmumication

Equation (2.6) indicates that one cannot recover X,|4| by simply

retrieving, although the approximation can be reasonable if there is a small

value of ) such thai:

W(e™)x0 for w-a,

pitched voice with a #4) — S04 requires a reclangular window of at least 20 ms

and a Tlamming window of al least 40 ms for the condition in Ey.(2.7) la be

met If speech is non-slalionary within 40ms, Laking such a long window

implies obtaining an average spectrum during that segment instead of several

dislinet spectra For this reason, the reelangular window provides betlor Gime

resolution than the Hamming windaw

In practice, the Fourier transform in Eq.(2.3) is obtained through an FFT If the window has length N, the FFT has lo have a length greater than or equal to N Since FFT algorithms often have lengths that are powers of 2 (L=2*), the windowed signal with length N is augmented with (ZL N’} zeros

cither before, after, or both This process is called zero padding A larger

value of L provides a finer description of the discrete Fourier transform; but it does not increase the analysis frequency resolution: this is the sole mission of

the window length V

Trom the short-time Fourier analysis’ result, we can build the

spectrograms Spectrogram is 4 useful speech signal representation in specch

recognition and synthesis Each phoneme is distinguished by its own unique

pattem in the spectrogram For vaiced phonemes, the signature involves large

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 21

Thesis for the degree of master of Information Processing and Communication

concentrations of energy called formants; within each formant, and typically

across all avlive formants, there is a charactorisuc waxing and waning of

energy in all frequencies which is the most salient characteristic of what we call the human voice; this cyclic pattem is caused by the repetitive opening

and closing of the vocal cords which occurs at an average of 125 times per

second in the average adult male, and approximately twice as fast (250 Hz) in the adult female, giving rise to the sensation of pitch

Since the spectrogram displays just the energy and not the phase of the

short-term Fourier transform, we compute the energy as in Eq.(2.8) with this

value converted Lo a gray scale according lo Figure 2.5 Pixels whose values

have not been computed are interpolated ‘Ihe slope controls the contrast of

the spectrogram, while the saturation points for white and black control the

dynamic range | arger log-energies correspond to a darker gray color

‘There are two main types of spectrograms: narrow-band and wide-

band, Wide-band spectrograms use relatively shorl windows (< 10 ms) and

thus have good time resolution at the expense of lower frequency resolution, since the corresponding filters have wide bandwidths - 200 Hz) and the

TDoan Thi Ngac Hicn _ Mastcr [PC 2003-2005

Trang 22

Thesis for the degree of master of Information Processing and Communication

harmonics cannot be seen Spectrograms can aid in determining formant

frequencies and fundamental frequency, as well as voiced and unvoiced

regions

2.2.2 Acoustic model of speech production

Speech is a sound wave created by vibration that is propagated in the

air Acoustic theory analyzes the laws of physics that govern the propagation

of sound in the vocal tract Such a theory should consider three-dimensional wave propagation, the variation of the vocal tract shape with time, losses due

to heat conduction and viscous friction at the vocal tract walls, softness of the

tract walls, radiation of sound at the lips, nasal coupling and excitation of

sound While a detailed model that considers all of the above is not yet

available, some madels provide a good approximation in practice, as well as a

good understanding of the physics involved

© Glottal Excitation

In the process of speech production, the vocal cords constrict the path from the lungs to the vocal tract ‘This is illustrated in Figure 2.6, volume

velocity is zero during the closed-phase, during which the vocal cords are

closed As lung pressure is increased, air {lows out of the lungs and through

the opening between the vocal cords (e/oifis) At one point the vocal cords are together, thereby blocking the airflow, which builds up pressure behind them

Eventually the pressure reaches a level sufficient lo ferce the vocal cords Lo

open and thus allow air to flow through the glottis Then, the pressure in the

glottis falls and, uf the tension in the vocal cords is properly adjusted, the reduced pressure allows the cords to come together, and the cycle is repeated

This condition of sustained oscillation occurs for voiced sounds The closed- phase of the oscillation takes place when the glollis is closed and the volume

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 23

Thesis for the degree of master of Information Processing and Communication

velocity is zero The openphase is characterized by a non-zero volume

velocily, in which the lungs and the vocal tract are coupled

Ẵ+—”m

41mg glots Open glotus

Figure 2.6 Glottal excitation

w Lossless Tube Concatenation

A widely used model for speech production is based on the assumption

that the vocal tract can be represented as a concatenation of lossless tubes

The constant cross-sectional areas { } 4 of the tubes approximate the area

function 4(x) of the vocal tract Tf'a large number of tubes of short length are

used, we reasonably expect the frequency response of the concatenated tubes

to be cluse to those of a tube with continuously varying area function For

frequencies corresponding to wavelengths that are long compared to the

dimensions of the vocal tract, it is reasonable to assume plane wave

propagation along the axis of the tubes Lin addition we assume that there arc

no losses due to viscosity or thermal conduction, and that the area A does not

change over lime, the sound waves in the tube satisfy the following pair of

'where p(~,/}1s the sound pressure ¡n the tube at position x and time /, ø(zx.)

is the volume velocity flow in the tube at position x and time f, ø is the

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 24

Thesis for the degree of master of Information Processing and Communication

density of air in the tube, c is the velocity of sound and 4 is the cross- scclional arca of the tube

Figure 2.7 Approximation of a tube with continuously varying area A() asa

concatenation of 5 lossless acoustic tubes

Since Eys.(2.9) are linear, the pressure and volume velocity in tube

A" are related by

s(s2)—n( #2) s0 x2)

where wi (t—x/c)and u,(¢—x/c)are the traveling waves in the positive and

nogalive directions respectively and x is the dislance measured from the Icf- hand end of tubes" 0<x

solution by substituting Iq (2.10) into (2.9)

‘When there 1s a junction between two tubes, as in Figure 2.8, part of the

<1 The reader can prove that this is indeed the

Trang 25

Thesis for the degree of master of Information Processing and Communication

Figure 2.8 Junction between two lossless tubes

A relationship between the z-transforms of the volume velocity at the glottis x,[s] and the lips ø;[z]for a concatenation of N lossless tubes can be

derived using a discrete-time version of Fq.(2.10) and taking into account boundary conditions for every junction

equivalent area at the lips %, and , arc the equivalent impedances at the

glottis and lips, respectively Such impedances relate the volume velocity and

pressure, for lhe lips the expression is

Trang 26

Thesis for the degree of master of Information Processing and Communication

mast W/2 complex conjugate poles, or resonances or formants These

resonances occur when a given lrequency gels frapped in the voval tact

because it is reflected back at the lips and then again back at the glottis Since each tube has length ?/ and there are N of them, the total length is L-V The

propagation delay in cach tube ø—//c „ and the sampling period is T—2r, the

round trip in a tube We can find a relationship between the number of tubes

N and the sampling frequency #2 _ 1/7:

which is 0 for low frequencies and reaches R, asymptotically This

dependency upon frequency results in a reflection coctficient that is also a

Junction of frequency For low frequencies, x,—1, and no loss occurs Al

higher frequencies, loss by radiation translates into widening of formant

bandwidths

In the production of the nasal consonants, the velum is lowered to trap

the nasal tract to the pharynx, whereas a complete closure is fonned in the

oral tact (/m/ at the lips, /n/ just back of the leeth and /ng/ just forward of the

velum itself ‘his configuration is shown in Figure 2.9, which shows two

branches, one of them completely closed For nasals, the radiation occurs

primarily ai lhe nostrils The sct of resonances is detormined by the shape and

length of the three tubes At certain frequencies, the wave reflected in the closure cancels the wave at the pharynx, preventing energy from appearing at

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 27

Thesis for the degree of master of Information Processing and Communication

nostrils The result is that for nasal sounds, the vocal tract transfer function

We) has anti-resonances (eros) in addilion to resonances It has also been

observed that nasal resonances have broader bandwidths than non-nasal

voiced sounds, due to the greater viscous friction and thermal loss because of

the large surface arca of the nasal cavity

Joe

tine AL

Gohs[] Vv —> ——> €) closure

Figure 2.9 Coupling of the nasal cavity with the oral cavity

* Source- Filter Models of Speech Production

Speech signals are captured by microphones that respond to changes in

air pressure Thus, it is of interest ta compute the pressure at the lips 72(z) ,

which can be obtained as

P(2)=U,(2)2, (2)=U, (2K, @)Z 6) @Q17

For voiced sounds we can model z„

| as an impulse train convolved

with g[7], the glottal pulse Since g[z] is of finite length, ils z-transform is an

Figure 2.10 Modcl of the glottal excitation for voiced sounds

The complete model for both voiced and unvoived sounds is shown in

Trang 28

Thesis for the degree of master of Information Processing and Communication

Figure 2.14 General discrete-time model of speech production

‘The excitation can be either an impulse train with period 7? and amplitude 4, driving a filter G(=) or random noise with amplitude 4,

We can simplify the model in Figure 2.11 by grouping C{=), ¥(s), and

(

unvoiced sounds ‘The simplified model is shown in Figure 2.12, where we

Z,(2) mlo ig(s) Lor voiced sounds, and ¥{z) and } imto £() for

make explicit the fact that the filter changes over time

This model is a decent approximation, but fails on voiced fricatives,

since those sounds contain both a periodic componcnt and an aspirated

component In this case, a mixed excitation model can be applied, using for

voiced sounds 4 sum of both an impulse train and colored nuise

The model in Figure 2.13 is appealing because the source is white (has

a flat spectrum) and all the coloring is in the filter Other source-filter

decompositions allempt to model the source as the signal at the gloits, in

which the source is definitely not white Since G(s), 7,(+) contain zeros, and V{z) can also contain zeros for nasals, H(=) is no longer all-pole

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 29

Thesis for the degree of master of Information Processing and Communication

Figure 2.13 A mixed excitation source-filter model of speech

2.2.3 Linear predictive coding

A very powerful method for speech analysis is based on #mear

predictive coding (LPC) This method is widely used because it is fast and

simple, yet an effective way of estimating the main parameters of speech

signals

An allpole filler with a sufficient number of polos is a good

approximation for speech signals ‘hus, we could model the filter H(=) in

Linear predictive coding gets its name from the fact that it predicts the

current sample as a linear combination of its past p samples

The prediction error when using this approximation is

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 30

Thesis for the degree of master of Information Processing and Communication

« The Orthogonality Principle

To estimate the predictor coclTicients from a sol of specch samples, we

use the short-term analysis technique Let’s define x,(”} as a segment of

speech selected in the vicinity of sample m:

Given a signal x,|:| , we estimate its corresponding LPC coefficients

as those that mmumize the total prediction exror #, Taking the derivative of

Eq, (2.24) with respect to « and equating to 0, we obtain:

Trang 31

Thesis for the degree of master of Information Processing and Conmumication

For convenience, we can define the correlation coefficients as

Solution of the set of p linear equations results in the p LPC

cocflicionts that mmimize the predicion orror With , salisliying Eq, (2.28),

the total prediction error in iq.(2.24) takes on the following value:

* Solution of the LPC Equations

There arc 3 different algorithms for the solution of the 1.PC Equations: the covariance method, the autocorrelation method and the lattice method In this scelion, we present the autocorrelation method, which is used in the

Trang 32

Thesis for the degree of master of Information Processing and Conmumication

with w[7] beg 4 window (such as a Ilamming window) which is 0 outside

the interval O<#<N With this assumption, the corresponding prediction

error ¢,|7| is non-zero over the interval 0<<W 1 p, and, therefore, the total

prediction error takes on the value

The matrix in Ey (2.39) is symmetric and all the elemenls in its

diagonals arc identical Such matrices arc called Toeplitz, Durbin’s recursion

exploits this fact resulting in a very efficient algorithm (for convenience, we

omit the subscript m of the autocorrelation function)

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 33

Thesis for the degree of master of Information Processing and Conmumication

Where the covllicionts &, called reflection coefficients, arc bounded between

] and 1.In the process of computing the predictor coefficients of order p, the

recursion finds the solution of the predictor cocflicients for all orders less than

p Replacing Rj] by the normalized autocorrelation coefficients 7[ j] , defined

as

results in identical LPC coefficients, and the recursion is more robust to

problems with arithmetic precision Likewise, the normalized prediction error

at iteration i is defined by dividing Eq (2.29) by R[O|, which, using Fg (2.36)

Trang 34

Thesis for the degree of master of Information Processing and Communication

Using Eq.(2.22), we can compute the prediction error signal, also called

the exciiation, or residual signal For unvoiced spccch synthetically generated

by white noise following an LPC filter we expect the residual to be approximately white noise In practice, this approximation is quite good, and

replacement of the residual by whitc noise followed by the LPC filter

typically results in no audible difference For voiced speech synthetically generated by an impulse train following an LPC filter, we expect the residual

to approximate an impulse train In practice, this is not the case, because the

all-pole assumption is not altogether valid, thus, the residual, although it

conlains spikes, is far [rom an impulse train Replacing the residual by an

impulse train, followed by the LPC filter, results in speech that sounds

somewhat robotic, partly because real speech is not perfectly periodic (it has a

random component as well), and because the zeroes are not modeled with the

LPC filter

Ilow do we choose p? This is an important design question Unvoiced

speech has higher error than voiced speech, because the LPC model is more accurate for voiced speech In general, the normalized error rapidly decreases,

and then converges lo a value of around 12 - 14 lor 8 kHz speech If we use a

large value of p, we are fitting the individual harmonics, thus the LPC filter is modeling the source, and the separation between source and filter is not going

to be so goad The more cociticicnts we have to estimate, the larger the

variance of their estimates, since the number of available samples is the same

A tule of thumb is to use 1 complex pole per kIIz plus 2 - 4 poles to model

the radiation and glottal effects

Tor unvoiced speech, both the autocorrelation and the covariance

methods provide similar results For voiced speech, however, the covariance method can provide better estimates if the analysis window is shorter than the

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 35

Thesis for the degree of master of Information Processing and Communication

local pitch period and the window only includes samples from the closed

phase (when the vocal tract is closed al the glottis and specch signal is duc

mainly to free resonances) ‘his is called pitch synchronous analysis and

results in lower prediction error, because the true excitation is close to zero

during the whole analysis window During the open phase, the trachea, the

vocal folds, and the voeal tract are acoustically coupled, and this coupling will change the free resonances Additionally, the prediction error is higher for

both the autocorrelation and the covariance methods if samples from the open

phase are included in the analysis window, because the prediction during

those insLants is poor

In this section we introduce the cepstrum as onc homomorphic

transformation that allows us to scparate the source from the filter We show

thal we can find a value N such that the cepstrum of the Ger 4[n]x0 for

n=N , and that the cepstrum of the excitation é|n|*0 for #<N With this

assumption, we can approximately recover both e|n| and A|n| from z|n| by

homomorphic filtering In Figure 2.15, we show how to recover A[n| with a

Trang 36

Thesis for the degree of master of Information Processing and Communication

Where D is the cepstrum operator

« The Real and Complex Cepstrum

The real cepstrum of a digital signal x|71| is defined as and the complex

You can see from Eqs.(2.53) and (2.54) that both the real and the

complex cepstrum satisfy Hq.(2.50) and thus they are homomorphic

transformations If the signal x]z] is real, both the real cepstrum e|z| and the

complex cepstrum %[nJare also real signals Therefore the term complex

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 37

Thesis for the degree of master of Information Processing and Communication

cepstrum doesn’t mean that it is a complex signal but rather that the complex

From here on, when we refer to cepstrum without qualifiers, we are

referring to the real cepstrum, since it is the most widely used in speech

technology The cepstrum was invented by Bogert and its term was coined by

reversing the first syllable of the word spectrum, given that it is obtained by taking the inverse Fourier transform of the log-spectrum Similarly, they

defined the term guefrency to represent the independent variable n in c[r]

The quefrency has dimension of time

* Ceptrum of Speech Signals

We can compute the cepstrum of a speech segment by windowing the

signal with a window of length N directly through its definition of Hq.(2.54),

using the DEI as follows:

This aliasing introduces errors in the estimation that can be reduced by

choosing 4 large value for N Computation of the complex copstrum requires

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 38

Thesis for the degree of master of Information Processing and Conmumication

computing the complex logarithm and, in turn, the phase However, given the

principal value of the phase 4,[&] , there are mlinite possible values Lor [4]

If x[x] is real, arg[_¥(e*) ]is an odd function and also continuos Thus

we can do phase unwrapping by choosmg », to guarantee that @[k] is a

smooth function, ie, by forcing the difference between adjacent values to be

small:

For unvoiced speech, the unwrapped phase is random, and therclore

only the real cepstrum has meaning In practical situations, even voiced

spooch has some frequencies al which noisc dominates (lypically very low and high frequencies), which results in phase @[k] that changes drastically

from frame to frame Because of this, the complex cepstrum in Eq (2.54) is

rarely used for real speech signals Instead, the real cepstrum is used much

Similarly, it can be shown that lor the new real cepstrum e,[x] is an

aliased version of ¢[#] given by which again has aliasing that can be reduced

dy choosing a large value for N

2.2.5 Perceptually- Motivated representations

In this section we describe some aspects of human perception, and method motivated by the behavior of the human auditory system: Mel-

Frequency Copstrum Coefficients (MFCC) This method has been

TDoan Thi Ngac Hicn _ Mastcr IPC 2003-2005

Trang 39

Thesis for the degree of master of Information Processing and Conmumication

successfully used in speech recognition First we present several nonlinear

frequency scales that have been used in such representations

« The bilinear Transform

The transformation

for 0<a<1 belongs to the class of bilinear transforms It is a mapping in the

complex plane that maps the unit circle onto itself The frequency

transformation is obtained by making the substitution z-e!* and s—e”

T acos(o}

Oppenheim showed that the advantage of this transformation is that it

can be used to transform a time sequence in the linear frequency into another

time sequence in the warped frequency ‘This bilinear transform has been

successfully applied to cepstral and autocorrelation coefficients Both sets of

coefficients are causal The input is the time-reversed cepstrum sequence, and the output can be obtained by sampling the outputs of the filters at time n — 0

The filters used for w[m] m > 2 are the same Note that, for a finite length

cepstrum, an infinite-length warped copstrum results

Trang 40

Thesis for the degree of master of Information Processing and Communication

For a finite number of cepstral coefficients the bilinear transform in

Figure 2.16 results in an infimite number of warped cepstral cociTicients

Since truncation is usually done in practice, the bilinear transform is

equivalent to a matrix multiplication, where the mairix is a function of the

‘warping paramcter

The Mel-Hrequency Cepstrum Coefficients (MKCC) is a representation

defined as the real cepstrum of a windowed short-time signal derived from the

FFT of that signal The difference from the real cepstrum 1s that a nonlincar

frequency scale is used, which approximates the behavior of the auditory

system The MICC representation is beneficial for speech recognition Given

the DFT of the input signal

Ngày đăng: 11/06/2025, 21:31

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm