1. Trang chủ
  2. » Luận Văn - Báo Cáo

Mô hình hóa đặc tính âm học động cho hệ thống nhận dạng tiếng nói việt bằng phần mềm kaldi và ứng dụng cho việc phân tích sự chuyển tiếp nguyên âm phụ âm =

99 26 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 99
Dung lượng 5,77 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Figure 3-16: Comparison between SSCFs and Formants using the purpose subband filters in /ai/ transition: a F1, b F2, c M1 and d M2.. 44Figure 3-17: Comparison between SSCFs and Formants

Trang 1

MÔ HÌNH ĐẶC TÍNH ÂM HỌC ĐỘNG CHO NHẬN DẠNG TIẾNG NÓI TIẾNG VIỆT VÀ ỨNG DỤNG CHO VIỆC PHÂN TÍCH SỰ

CHUYỂN TIẾP NGUYÊN ÂM – NGUYÊN ÂM

MASTER THESIS OF SCIENCE COMPUTER SCIENCE

Hanoi – 2018

Trang 2

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

-

Nguyen Hang Phuong

MODELING DYNAMIC ACOUSTIC FEATURE OF SPEECH FOR VIETNAMESE SPEECH RECOGNITION AND APPLICATION FOR

ANALYZING VOWEL – TO – VOWEL TRANSITIONS

MÔ HÌNH ĐẶC TÍNH ÂM HỌC ĐỘNG CHO NHẬN DẠNG TIẾNG NÓI TIẾNG VIỆT VÀ ỨNG DỤNG CHO VIỆC PHÂN TÍCH SỰ CHUYỂN TIẾP

NGUYÊN ÂM – NGUYÊN ÂM

Specialty: Computer Science International Research Institute MICA

MASTER THESIS OF SCIENCE COMPUTER SCIENCE

SUPERVISOR:

Prof.Dr Eric Castelli

Dr Nguyen Viet Son

Hanoi – 2018

Trang 3

DECLARATION OF AUTHORSHIP

I, NGUYEN Hang Phuong, declare that this thesis titled, ―Modeling dynamic acoustic feature of speech for Vietnamese speech recognition and application for analyzing Vowel – to – Vowel transitions‖ and the work presented in it are my own I confirm that:

 This work was done wholly or mainly while in candidature for a research degree at this University

 Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated

 Where I have consulted the published work of others, this is always clearly attributed

 Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work

 I have acknowledged all main sources of help

 Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself

Signed:

Date:

Trang 4

ACKNOWLEDGEMENTS

It is an honor for me to be here to write thankful words to those who have been supporting, guiding and inspiriting me from the moment, when I started my work in International Research Institute MICA until now

I owe my deepest gratitude to my supervisors, Prof Eric Castelli and Dr Nguyen Viet Son Their expertise, understanding and generous guidance made it possible to work in a new topic for me They have made available their support in a number of ways to find out the solution to my works It is a pleasure to work with them

Special thanks to Dr Mac Dang Khoa, Dr Do Thi Ngoc Diep, Dr Nguyen Cong Phuong and all of members in the Speech Communication Department for their guidance which help me a lot in how to study and to do research in right way, and also the valuable advices for my works

Finally, this thesis would not have been possible if there were no encouragement from

my family and friends Their words give me power in order to overcome all the embarrassment, discouragement and other difficulties Thanks for everything helping me to get this day

Hanoi, 23/03/2018 Nguyen Hang Phuong

Trang 5

TABLE OF CONTENTS

DECLARATION OF AUTHORSHIP 3

ACKNOWLEDGEMENTS 4

TABLE OF CONTENTS 5

LIST OF FIGURES 7

LIST OF TABLES 9

LIST OF ABBREVIATIONS 10

CHAPTER 1: INTRODUCTION 11

1.1 Overview about Automatic Speech Recognition 11

1.2 The objective of thesis 12

1.3 Thesis outline 12

CHAPTER 2: ACOUSTIC FEATURES FOR ASR SYSTEM 13

2.1 The goals of acoustic feature extraction in an ASR system 13

2.2 Speech signal characterization 14

2.2.1 Speech is non-stationary 14

2.2.2 Static and dynamic characterization of speech signal 19

2.2.2.1 Static characterization of speech signal 19

2.2.2.2 Dynamic characterization of speech signal 19

2.3 Static speech feature 21

2.3.1 An overview about MFCCs feature 21

2.3.2 Limitation of MFCCs 22

2.4 State of the art of modeling dynamic acoustic speech feature 23

2.5 Conclusion of chapter 2 23

CHAPTER 3: IMPROVEMENT ON AUTOMATIC MODELING DYNAMIC ACOUSTIC SPEECH FEATURE 25

3.1 Improvement on computing Spectral Sub-band Centroid Frequency (SSCF)25 3.1.1 SSCF features generated from the original definition 25

3.1.2 Influence of subband filters on the SSCF features 28

3.1.3 New proposal design of subband filters 40

3.1.4 Analyis SSCFs when applied new design of subband filter 42

3.1.4.1 /ai/ and /ae/ transition 43

Trang 6

3.1.4.2 /ao/ and /au/ transition 46

3.1.4.3 /ia/ and /ea/ transition 48

3.1.4.4 /oa/ and /ua/ transition 51

3.1.4.5 /iu/ transition 53

3.1.4.6 /ui/transition 54

3.2 New proposal approach to automatic SSCF angles computation 55

3.2.1 Definition and Automatic SSCF angles calculation 55

3.2.2 Analysis SSCF angles in Vowel-Vowel transition 57

3.2.2.1 /ai/ and /ae/ transition 58

3.2.2.2 /ao/ and /au/ transition 61

3.2.2.3 /ia/ and /ea/ transition 63

3.2.2.4 /oa/ and /ua/ transition 66

3.2.2.5 /iu/ transition 69

3.2.2.6 /ui/ transition 71

3.3 Conclusion of chapter 3 73

CHAPTER 4: APPLY SSCF ANGLES TO A SPEECH RECOGNITION TOOLKIT, KALDI 74

4.1 An overview about Kaldi – an open source for speech recognition 74

4.2 Balanced speaker experiments on Kaldi 76

4.2.1 Using MFCC features 76

4.2.2 Using SSCF angles 77

4.3 Unbalanced speaker experiments on Kaldi using SSCF angles 79

4.4 Conclusion of chapter 4 80

CHAPTER 5: CONCLUSION AND FUTURE WORK 82

5.1 Conclusion 82

5.2 Future work 83

PUBLICATIONS 84

REFERENCES 85

APPENDIX 87

Trang 7

LIST OF FIGURES

Figure 2-1: An outline of a typical speech recognition system [3] 13

Figure 2-2: a) Single-tone sine wave of 10 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectrum of single-tone sine wave respectively [9] 15

Figure 2-3: a) Multi-tone sine wave of 10, 50 and 100 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectrum of corresponding multi-tone sine wave [9] 16

Figure 2-4: a) Non-stationary multi-tone sine wave of 10, 50 and 100 Hz sampled using a sampling frequency of 1000 Hz, b) Magnitude spectra of corresponding non-stationary multi-tone sine wave [9] 17

Figure 2-5: a) Speech signal for the Hindi word ―sakshaat‖, b) Corresponding spectra of different segments of the Hindi speech signal [9] 18

Figure 2-6: Flow chart for MFCC computation [21] 22

Figure 3-1: The algorithm is for extracting SSCFs 25

Figure 3-2: The shape of six-triangle overlapped subband filters for computing SSCF 26

Figure 3-3: SSCF parameters extraction from a speech signal following frame by frame [13] 26

Figure 3-4: Comparison results between formants and SSCFs in /ai/ when apply six subband filters: a) F1, b) F2, c) M1 and d) M2 27

Figure 3-5: The shape of five-triangle overlapped subband filters for computing SSCF 28

Figure 3-6: Comparison results between formants and SSCFs in /ai/ when apply five overlap subband filters: a) F1, b) F2, c) M1 and d) M2 29

Figure 3-7: The method for evaluation the effect of the number of subband filters on SSCF results 30

Figure 3-8: Comparison when using 5 or 6 Triangular Subband Filters in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 32

Figure 3-9: Comparison when using 5 or 6 Triangular Subband Filters in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 34

Figure 3-10: Comparison when using 5 or 6 Triangular Subband Filters in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 36

Figure 3-11: Comparison when using 5 or 6 Triangular Subband Filters in /au/ transition: a) the first female (F1), b) the second female (F2), c) the first male (M1), d) the second male (M2) 38

Figure 3-12: [aV] trajectories for native French speakers at normal rate: a) F1-F2 plane at publishcation [25]; SSCF1-SSCF2 plane from measurement results when b) using 6 triangular filters, c) using 5 triangular filters 39

Figure 3-13: The definition of subband filter with equal length in mel-scale: a) five-triangle subband filters, b) six -triangle subband filters 41

Figure 3-14: The shape of new proposal six subband filters 42

Figure 3-15: a) The trajectories in Vowel-to-Vowel French transition obtained with simulation [27], [28], French vocalic triangle in SSCF1-SSCF2 plane: b) For two native females, c) For two native males 43

Trang 8

Figure 3-16: Comparison between SSCFs and Formants using the purpose subband filters in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 44Figure 3-17: Comparison between SSCFs and Formants using the purpose subband filters in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 45Figure 3-18: Comparison between SSCFs and Formants using the purpose subband filters in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 46Figure 3-19: Comparison between SSCFs and Formants using the purpose subband filters in /au/ transition: a) F1, b) F2, c) M1 and d) M2 47Figure 3-20: Comparison between SSCFs and Formants using the purpose subband filters in /ia/ transition: a) F1, b) F2, c) M1 and d) M2 49Figure 3-21: Comparison between SSCFs and Formants using the purpose subband filters in /ea/ transition: a) F1, b) F2, c) M1 and d) M2 50Figure 3-22: Comparison between SSCFs and Formants using the purpose subband filters in /oa/ transition: a) F1, b) F2, c) M1 and d) M2 51Figure 3-23: Comparison between SSCFs and Formants using the purpose subband filters in /ua/ transition: a) F1, b) F2, c) M1 and d) M2 52Figure 3-24: Comparison between SSCFs and Formants using the purpose subband filters in /iu/ transition: a) F1, b) F2, c) M1 and d) M2 53Figure 3-25: Comparison between SSCFs and Formants using the purpose subband filters in /ui/ transition: a) F1, b) F2, c) M1 and d) M2 54Figure 3-26: SSCF angles12 in SSCF1/SSCF2 plane [13] 56Figure 3-27: SSCF angles calculated from the purpose definition in /ai/ transition: a) F1, b) F2, c) M1 and d) M2 59Figure 3-28: SSCF angles calculated from the purpose definition in /ae/ transition: a) F1, b) F2, c) M1 and d) M2 60Figure 3-29: SSCF angles calculated from the purpose definition in /ao/ transition: a) F1, b) F2, c) M1 and d) M2 61Figure 3-30: SSCF angles calculated from the purpose definition in /au/ transition: a) F1, b) F2, c) M1 and d) M2 62Figure 3-31: SSCF angles calculated from the purpose definition in /ia/ transition: a) F1, b) F2, c) M1 and d) M2 64Figure 3-32: SSCF angles calculated from the purpose definition in /ea/ transition: a) F1, b) F2, c) M1 and d) M2 65Figure 3-33: SSCF angles calculated from the purpose definition in /oa/ transition: a) F1, b) F2, c) M1 and d) M2 67Figure 3-34: SSCF angles calculated from the purpose definition in /ua/ transition: a) F1, b) F2, c) M1 and d) M2 68Figure 3-35: SSCF angles calculated from the purpose definition in /iu/ transition: a) F1, b) F2, c) M1 and d) M2 70Figure 3-36: SSCF angles calculated from the purpose definition in /ui/ transition: a) F1, b) F2, c) M1 and d) M2 72Figure 4-1: A schematic overview of the Kaldi toolkit [30] 75

Trang 9

LIST OF TABLES

Table 2-1: Four types of signal as elaborated [8] 14

Table 3-1: The definition of SSCF angles 57

Table 4-1: A full description of speech database 76

Table 4-2: Syllable Error Rate (SER%) using MFCCs and their derivations 77

Table 4-3: Syllable Error Rate (SER) using SSCF angles and their derivations 78

Table 4-4: Syllable error rate (%) in Vietnamese ASR using SSCF angles and their derivations in the unbalanced speaker experiment 80

Trang 10

LIST OF ABBREVIATIONS

ASR Automatic Speech Recognition MFCC Mel-Frequency Cepstral Coefficients SSCF Spectral Subband Centroid Features

Trang 11

CHAPTER 1: INTRODUCTION

1.1 Overview about Automatic Speech Recognition

Among the tasks for which machines may simulate human behavior, automatic speech recognition (ASR) has been foremost since the advent of computers The logical partner of ASR, automatic speech synthesis, existed before practical computing machines, although the quality of synthetic speech has only recently become reasonable In earlier times, devices were built that approximated the acoustics of human vocal tracts (VTs), as the basic mechanisms of speech production were evident to early scientists, using models based upon musical instruments A device to understand speech, however, needed a calculating machine capable of making complex decisions, and, practically, one that could function as rapidly as humans As a result, ASR has grown roughly in proportion to other areas of pattern recognition (PR), in large part based on the power of computers to capture a relevant signal and transform it into pertinent information, i.e., recognizing a pattern in the (speech) signal [1]

The most common objective of ASR is a textual translation of the speech signal, i.e., the text corresponding to what one has said Other useful outputs include the language of the speech, the speaker's emotional state, and the speaker's identity [1] A very practical use for ASR is as part (along with natural language understanding and automatic speech synthesis) of

a human-machine dialogue, whereby a user can interact efficiently with a machine/robot, e.g., telephony [1]

So, what should an ASR look for? The peaks of the speech signal's spectral envelope (especially the center frequencies of F1, F3 and F3) seem to be very pertinent features for which various VT shapes used in sonorants cause reliable dispersion of phonemes in F1-F2-F3 space In addition, the human auditory system seems well tuned to perceive variations in such spectral peak positions Of less relevance appear to be the formant bandwidths; these are less readily controlled by speakers, and less easily distinguished by listeners (Similar comments hold for the general fall-off spectral slope: sonorant spectra generally decline at an approximate rate of −6 dB/octave, owing mostly to the low-pass nature of glottal excitation, and variations in such slope generally evoke little perceptual notice) Of clear importance to speech perception (and hence to ASR) is the general intensity of speech Sonorants are much stronger than obstruents, and, within these two classes of sounds, intensity also varies reliably; e.g., /a/ stronger than /i/, /s/ than /f/ Such distinctions can be achieved based on spectral peak position alone, without the cue of intensity, but intensity (although often redundant) is easily measurable and commonly used by listeners Nonetheless, ASR often uses intensity less than human listeners do, as the level of a speech signal varies greatly with recording conditions A common form of signal normalization in ASR is to await the end of

an utterance, and then subtract the average value from each parameter in the time sequence This ―differential‖ analysis focuses ASR's attention on frame-to-frame changes in speech, rather than on absolute values Such differentiating can still allow ASR to notice relatively

Trang 12

loud sounds, in terms of a series of frame-to-frame increases in intensity However, most ASR systems use very localized feature measures, owing to the first-order Markov models employed [1]

1.2 The objective of thesis

Therefore, based on these analyses, this thesis tries to do three main ideals These ideals focus on a new approach in building an ASR system

Firstly, this thesis proposes an improvement in computing Spectral Subband Centroid

Features (SSCFs) The target of improvement method is to obtain SSCFs which is similar with Formant frequencies

Secondly, this thesis proposes an automatic SSCF angles computation using relative

angles Results of proposed dynamic feature will be detail analyzed in Vowel – to – Vowel transitions The computation of dynamic feature is absolutely automatic process The aim of modeling dynamic feature is to improve the limitation about gender dependence

Finally, this thesis uses the dynamic feature modeling approach in a continuous speech

database, and applies in an ASR system which is built by Kaldi software

1.3 Thesis outline

This thesis work has been organized as follows:

Chapter 2 studies an overview about acoustic feature for an ASR system, and shows two-basis nature of speech They are non-stationary, static and dynamic characterizations of speech After that, Chapter 2 analyzes some limitations of MFCC feature, and state of the art

of modeling dynamic acoustic speech feature

Chapter 3 presents automatic modeling of dynamic acoustic speech features Chapter 3 studies in two main targets The first one is improving ―pseudo-formant‖ parameters – SSCF With the original definition, six SSCF parameters are not suitable with the aim of this thesis when compared with Formants Therefore, Chapter 3 focused on studying the influence of subband filters to the SSCF, then improved and proposed a new design of six subband filters The second aim is proposing a new way to automatic model the dynamic acoustic speech features - SSCF angles The results of SSCF and SSCF Angles are evaluated in Vowel – to Vowel transitions

The overview of Kaldi software and some ASR experiments in Kaldi are shown in Chapter 4 The Vietnamese continuous speech database is used for these ASR experiments These experiments are classified into two groups The first group is using balanced speaker database, and the second one is using unbalanced speaker data The speech features are MFCC and proposed SSCF angles

Trang 13

CHAPTER 2: ACOUSTIC FEATURES FOR ASR SYSTEM

2.1 The goals of acoustic feature extraction in an ASR system

Automatic speech recognition has been an active research area for over five decades It has always been considered as an important bridge in fostering better human–human and human–machine communication [2] In the basic definition, an ASR system converts speech from a recorded audio signal to text [3]

Figure 2-1: An outline of a typical speech recognition system [3]

Following this outline, feature extraction is the second step of an ASR system, after the preprocessing step This component should acquire descriptive features from the windowed and enhanced speech signal to enable a classification of speech signal In general, feature extraction is an answer for the question how to represent speech waveform into mathematics Feature extraction is one of the most important parts in an ASR system because the raw speech signal contains information besides the linguistic message and has a high dimensionality Both characteristics of the raw speech signal would be unfeasible for the classification of sounds and result in a high word error rate Therefore, the feature extraction algorithm derives a characteristic feature vector with a lower dimension than original one, which is used for the classification of sounds [1], [4]

A feature vector of speech should emphasize the important information which regards the specific task and suppresses all other information With the goal of automatic speech recognition is to transcribe the linguistic information from an input speech waveform, so this information needs to be emphasized The speaker dependent characteristics, the characteristics of the environment and the recording equipment should be suppressed because these characteristics do not contain any information about the linguistic information Including this non-linguistic information would introduce an additional variability, which could have a negative impact on the separability of the phone classes Furthermore, the feature extraction should decrease the dimensionality of the data to reduce the computation time and the number of training samples [4]

Globally, there are some desirable feature characteristics after feature extraction step They are respectively [5], [6], [7]:

- Capturing essential information for sound and word identification

- Reducing the size of feature vector and compress information into compact form

Trang 14

- Factor out information that‘s not relevant to recognition such as vocal-tract length of speakers, channel characteristics

- Suppressing the speaker dependent characteristics and the recording condition characteristics

- Features can be well-modeled by known distributions (for instance Gaussian models)

- Feature could be widely used in ASR

2.2 Speech signal characterization

2.2.1 Speech is non-stationary

A signal is an observation It is a recording of something which has happened or a recording of a series of events as a result of some process The signals are generated by the systems and they include the information about the systems from where they are originated

To extract the information from signals and reveal the underlying dynamics that corresponds

to the signals, it is necessary to use proper signal processing techniques These techniques usually transform a time-domain into another domain The purpose of the transformation is that extracting the characteristic information embedded within the time series [8] Depending

on the techniques, in most cases, the signals could be classified into four categories A generally example about these four types is shown in Table 2-1

Table 2-1: Four types of signal as elaborated [8]

Trang 15

These four signal types are based on whether the signal is stationary or non-stationary

So, how to recognize the difference between stationary and non-stationary characterization? The answer is that the frequency or spectral contents of signal is or not changing with respect

to time This is a very important point to classify signal The following arguments verify that answer

First of all, let observer a typical single-tone sine wave It is Type I In mathematically,

it can be represented as given in Eq 2.1 [9]:

Magnitude spectrum of single-tone sine wave respectively [9]

Trang 16

As observation in Figure 2-2, the sine wave amplitude varies in the time, but its frequency content only has one frequency and the corresponding magnitude spectrum does not change Secondly, let consider frequency and magnitude components of a multi-tone sine wave That sine wave is type II and it contains many frequency components Mathematically, it can

1000 Hz, b) Magnitude spectrum of corresponding multi-tone sine wave [9]

Trang 17

An instance is a multi-tone sine wave made of three frequency components They are 10 Hz,

50 Hz and 100 Hz, respectively and the graph is shown in Figure 2-3a Even though this signal looks complicated when compared to the single-tone sine wave shown in Figure 2-2, it

is still a stationary signal because the frequency contents do not change with time as shown in Figure 2-3b If all of the frequency components do not change with time, then without any consideration to the number of frequency components present, the multi-tone sine wave is still

[9]

Trang 18

Let the frequency component of f1 Hz be present in the first interval, then two components of f1 and f2 Hz in the second interval, three components of f1, f2 and f3 Hz for the third interval and finally f1 Hz in the fourth interval A mathematical formulation for the just described signal can be written as given in Eq 2.3 [9]

For instance, considering a non-stationary multi-tone sine wave of duration 1 sec which has

10 Hz component for 0-250 msec, 10 and 50 Hz components for 250-500 msec, 10, 50 and

100 Hz components for 500-750 msec and only 10 Hz component for 750-1000 msec as shown in Figure 2-4 Following the equation and the chart, the frequency components are changing from one interval to the other Thus, such a signal qualifies the definition of non-stationary

a)

b) Figure 2-5: a) Speech signal for the Hindi word “sakshaat”, b) Corresponding spectra of different

segments of the Hindi speech signal [9]

Trang 19

Based on that frequency characteristic becomes different in time domain, there is a question here: ―Is the speech signal stationary or non-stationary?‖ To answer this question in the simplest way, let consider an example about a speech signal for the Hindi word

"SAKSHAAT" The speech signal in time domain and some corresponding spectra parts are shown in Figure 2-5 When comparing this figure with Figure 2-3 and Figure 2-4, it could be defined that the speech signal is a much more complicated non-stationary signals compared to the descriptions above Firstly, there may not be just one, two or three components, but many more in a given interval of time Secondly, the interval itself will be very short, as short as about 10-30 msec as against 250 msec considered To summarize, the frequency contents of the speech signal will have many frequency contents and these components change continuously with time Furthermore, fundamental frequency and formant frequencies of speech signal also vary in time [10], [11] So that, in conclusion, speech signal has full nature

of the non-stationary signal

2.2.2 Static and dynamic characterization of speech signal

2.2.2.1 Static characterization of speech signal

From the 19thcentury up to now, most speech scientists still portrayed the speech signal

as a sequence of static states interleaved with transitional elements reflecting the continuous nature of vocal production After all, there must be static, stable elements internally if listeners can perceive and label individual phonemes in the speech stream [12] While these discrete representation-static targets reached during production and recovered during perception may describe, at best, clearly pronounced ―hyper-articulated‖ speech in which departures from the canonical are rare, it badly fails to characterize spoken language where such departures constitute the norm A good example for the limitations of phonemic representation is an analysis of 45 minutes of spontaneous conversational speech in which 73 different forms of the word ―and‖ were seen, and yet all of them were unambiguously identified by listeners [13]

quasi-In 1952, Gordon E Peterson and Harold L Barney published their study of ―control method used in a study of the vowels‖ This is a landmark paper in that vowels are generally characterized by the first two or three formant frequencies Each of them can be represented

in the acoustic space (F1-F2 plane) by a dot An early acoustic phonetic study of spectrographic data in terms of manner and place features and of the temporal distribution of information bearing elements appeared in the work of Fant This specification is static [13] Moon and Lindblom considered that for each language and each speaker, vowels can be specified in terms of underlying ‗targets‘ corresponding to the context- and duration-independent values of the formants as obtained by fitting ‗decaying exponentials‘ to the data points The point in focus here is that this specification is static and, significantly, may be taken to imply that the perceptual representation corresponds to the target values [13]

2.2.2.2 Dynamic characterization of speech signal

Dynamic nature is shown in two approaches The first one is production dynamics of speech, and the second one is perceptual dynamics of speech

Trang 20

Firstly, Dynamic Specification is indicated in the Production of Speech and Sign Sensory systems prefer time-varying over static stimuli An example of this fact is provided

by the dynamic spectro-temporal changes of speech signals which are known to play a key role in speech perception To some investigators such observations provide support for adopting the gesture as the basic entity of speech An alleged advantage of such a dynamically defined unit - over the more traditional, static and abstract, phoneme or segment - is that it can readily be observed in phonetic records However, as has been thoroughly documented throughout the last fifty years, articulatory and acoustic measurements are ubiquitously context-dependent That makes the gesture, defined as an observable, problematic as a primitive of phonetic theory An analysis of articulatory and sign movement dynamics is presented in terms of a traditional model based on timeless spatial specifications (targets, via points) plus smoothing (as determined by the dynamics of speech effectors) [12] A development that responds to the dynamic specification approach is the ―dynamic specification‖ approach proposed by Winifred Strange It is based on a series of experiments demonstrating that listeners are able to identify vowels with high accuracy although the center portions of CVC stimuli have been removed leaving only the first three and the last four periods of the vowel segment In other words, vowel perception is possible also in ‗silent-center‘ syllables - that is, syllables that lack information on the alleged ‗target‘ but includes an initial stop plosion and surrounding formant transitions Strange takes her findings to imply that vowel identification in CVC sequences is based on more than just the information contained within a single spectral slice sampled near the midpoint ―target‖ region of the acoustic vowel segment Rather the relevant information is distributed across the entire vowel and it includes formant frequency time variations [12]

Secondly, Dynamic Specification is indicated in the Perception of Speech In order for verbal communication to occur, the perceptual mechanisms responsible for decoding the speech signal must also take into consideration its dynamically changing nature [12] Perception data was used to verify production dynamics and, as a result, showed perceptual consequences of the dynamic processes of speech production The consonant-vowel (CV) and vowel-consonant (VC) transitions do produce envelope changes in separate frequency bands, their signal changes in the vocal tract resonance pattern and thus generate rapid (20- to 50-ms) formant glides: frequency-modulation (FM) sweeps the percept of which, in the second- and third-formant (F2 and F3) ranges, approaches that of sinusoidal glides These glides represent important cues for consonant and vowel identification (see Carré‘s chapter) However, although these increasing or decreasing monotonic FMs could be seen as dynamic changes in the frequency domain, they also represent volley-like short-duration amplitude envelope increases across frequency channels – i.e., patterns of AM These volleys can be simulated in

a manner analogous to two successive light flashes giving the percept of a motion between them (called the phi-phenomenon) The identity of a consonant can be established from sparse spectral representation of a transition as long as its time course between the endpoints is similar to that of a transition in real speech It is as if the auditory system does not care about the details of the spectral profile and performs a running weighted-averaging, or spectral center-of-gravity (c-o-g) computation, across the active frequency channels, no matter how

Trang 21

sparsely represented they are The paucity of the necessary spectral information (i.e., the number of channels needed) showed that intelligibility can be achieved through spectral

―slits‖, and the absence of sufficient energy to create a formant is interpreted by the listener as

a spectral zero (that is, a nasality cue) following an instantaneous transition

2.3 Static speech feature

Until now there are many different methods to extract speech features, which highlight different representations of the speech signal These features can be mostly static speech features It is because that these features are mostly extracted from the spectrum where the human speech production controls the spectrum of the signal and the ear acts as a spectrum analyzer [4] For example, some of the static features are Intensity [14], Linear Predictive Coding (LPC) [15], Perceptional Linear Predictive Coefficients (PLP) [16], Mel-Frequency Cepstral Coefficients (MFCCs) [17], Linear Prediction Cepstral Coefficients (LPCC), Wavelet Based Features [18] and Non-Negative Matrix Factorization features [19] Among these features, the most commonly used feature extraction method in an ASR system is MFCC This section will show an overview about calculation and analyze some limitations of MFCCs in speech feature domain

2.3.1 An overview about MFCCs feature

This feature extraction method was first mentioned by Bridle and Brown in 1974 and further developed by Mermelstein 1976 and has been state-of-the-art ever since [1] The MFCC calculation steps are shown in Figure 2-6 To extract a feature vector containing all information about the linguistic message, MFCC copies some parts of the human speech production and speech perception MFCC simulates the logarithmic perception of loudness and pitch of human auditory system and tries to eliminate speaker dependent characteristics

by excluding the fundamental frequency and their harmonics

Generally, MFCCs feature contains 13 dimensions For calculation of 12-first coefficients, an FFT spectrum is obtained for each speech frame, for which the logarithm is then taken of the spectral amplitude (converting to decibels, and discarding the spectral phase), a set of triangular filters spaced according to the perceptual mel scale weights this result, and finally an inverse FFT is done The 13th coefficient is energy of signal in each frame Based on calculation method, the MFCC feature vector describes only the power spectral envelope of a single frame Because of representation in spectrum, 13 MFCCs are called as static spectral coefficients Therefore, to represent the dynamic nature of speech, the MFCC includes the change of the feature vector over time as part of the feature vector Deltas and Delta-Deltas MFCC are represented as velocity and acceleration parameters and they are called as dynamic spectral coefficients [20], [21], [22]

Trang 22

Figure 2-6: Flow chart for MFCC computation [21]

Another limitation is that their major flaw lies in their final calculation step, the inverse FFT; taking low order cosine weightings of the log spectrum is motivated entirely on mathematical grounds unrelated to speech communication [1] Only the first two cepstral coefficients C0 and C1 have a meaningful interpretation The first MFCC (C0) is simply a version of energy (i.e., weighting with a zero-frequency cosine), and C1 has a reasonable interpretation as indicating the global energy balance between low and high frequencies (the low range positively weighted by the first half of the single cosine period, and vice versa for the second half) The other cepstral coefficients have no clear interpretation other than they contain the finer detail of the spectrum to discriminate the sounds Because of this lack of interpretations, the reaction of MFCC features to accents or noise is unknown Consequently, the feature vector distributions for each speaker must be merged, which yields greater variances and could reduce the separability of the classes [3]

Lastly, cepstral coefficients apply an equal weight to high and low amplitudes to the log spectrum even though it is known that high energy amplitudes dominate the perception of speech This equal weight reduces the robustness of cepstral coefficients because the noise fills the valleys between formants and harmonics and deteriorates the performance of MFCCs [1]

Trang 23

2.4 State of the art of modeling dynamic acoustic speech feature

In the speech science, it is exhibited that natural speech is not a simple succession of steady-state segments, but rather a dynamic process because speech is an action, a sound, or a signal continuously changing in time Phonetics and speech science are offspring of classical phonology, speech has been viewed as a sequence of discrete events-positions of the articulatory apparatus, waveform segments, and phonemes [12] Therefore, there are some research which incorporated the dynamic aspects of speech into real applications However, these dynamic concepts remain chiefly seen as ways to supplement static parameters: dynamic concepts are seen as deriving from static ones And the function devoted to them is essentially that of reinforcing or accelerating processing methods that remain based on static parameters [13] Therefore, research on the dynamic nature of speech is a new and potential direction in the field of speech recognition

At MICA institute, there is a study on modeling of dynamic acoustic speech feature It belongs to the doctoral dissertation of Mrs Tran Thi Anh Xuan Her research focuses on a new acoustic gesture modeling in vowel – vowel transition and applies to a Vietnamese speech recognition system Following her study, Spectral Subband Centroid Features (SSCFs) [24] can replace formants and act as ―pseudo-formant‖ even during consonant production It

is because SSCF parameters were demonstrated to be similar to formant frequencies but contrary to formant frequencies, they are continuous even during the obstruent consonant production Beside she also purposed a way to be able to model acoustic and dynamic speech features from SSCFs She called SSCF angles SSCF angles were used for computation of vowel – vowel transition More detail about SSCF angle will be shown in Chapter 3 of this thesis Her results showed that SSCF angles are more or less the same value for both male and female speakers on the same V1V2 transition sequence And they are also fairly invariant for speech rate (normal and fast speech rate) for each speaker [13] An interesting thing from her dissertation is the characterization of dynamic acoustic gestures can be a great advantage for automatic speech recognition because it allows designing a speech recognition system that is intrinsically independent of the speakers However, in her first tests, even if they were very interesting, they were not enough There are some issues to improve SSCF angles The first possible directions for improvements are from the calculation For examples, a better calculation of angles is to closer to the theory, in order to obtain an input vector more representative of the acoustic gesture A better characterization of these acoustic gestures during production of consonants, and taking better account of the speed of the transition, not only in calculating the derivative of the angle, but also in calculating a specific ―speed‖ angle, directly measured on speed transitions [13] Another improvement is that her research on the transitional angles just calculated about absolute angles There is no study of relative angles presented in her thesis

2.5 Conclusion of chapter 2

Chapter 2 studied an overview about acoustic feature for an ASR system Section 2.1 presented the goals and characterization of speech feature when applied in an ASR system To

Trang 24

understand about speech signal, section 2.2 shows two-basis nature of speech They are stationary, static, and dynamic properties of speech

non-After that, section 0 studied an overview about a typical static feauture – MFCC Although MFCC feature vectors are most widely used in ASR systems, the MFCC feature vectors have some limitations Firstly, MFCCs feature only represents energy characteristics and does not describes frequency characteristic of speech signal Secondly, MFCC is speaker dependent characteristics and this limitation should be improved

Finally, section 2.4 shows about state of the art of modeling dynamic acoustic speech feature Although these current dynamic parameters are still based mainly on spectral domain and then convert to cepstral domain and the dynamic concept here is chiefly the derivative and acceleration from the static parameters, but these parameters also premised firstly for the dynamic speech feature approach in real speech application in general, and in automatic speech recognition system in particular

Trang 25

CHAPTER 3: IMPROVEMENT ON AUTOMATIC MODELING

DYNAMIC ACOUSTIC SPEECH FEATURE

3.1 Improvement on computing Spectral Sub-band Centroid Frequency (SSCF)

3.1.1 SSCF features generated from the original definition

Spectral Subband Centroid Feature (SSCF) was first proposed by Paliwal in 1998 This feature are considered as ―pseudo-formant‖ parameters because it has similar properties to formant frequencies [24] Furthermore, SSCFs parameters are continuous parameters on time domain, even during consonant production [13] The algorithm for computing SSCF is described in Figure 3-1

Figure 3-1: The algorithm is for extracting SSCFs

Following the theory introduced in [24], the calculation process contains two main steps In the first step, a frequency band (0 to Fs/2, where Fs is the sampling frequency in Hz) is divided into a M fixed number subbands Each subband has lower and higher edges and a filter shape The number of subband filters depends on the purpose of the study, and the dimensions of SSCF are equal the number of subband filters The second main step is that compute the centroid for each subband using the power spectrum of the speech signal Each centroid has its frequency and its magnitude The mth spectral subband centroid frequency SSCFm is computed by the following formula 3.1

( ) ( )

( ) ( )

m

m m m

h

m l

m l

- P(f) is the power spectrum,

- γ is a constant controlling the dynamic range of the power spectrum By setting γ<1, the dynamic range of the power spectrum can be reduced [24] In this thesis, γ is chosen as 0.5

The primary objective of this thesis is to achieve six SSCFs (SSCF0, SSCF1, SSCF2, SSCF3, SSCF4, and SSCF5) which are not Formant frequencies, but are similar to Formant frequencies It is because that there are six formants (F0, F1, F2, F3, F4, and F5) to represent frequency of speech signal For the purpose of using SSCF instead of Formants and acting as

Trang 26

―pseudo-formant‖, the number of subband filter is six (M=6) The filter bank is designed by dividing 6 equal length subbands on the mel-scale, then converting into frequency domain, and employing a triangle shape for each subband filter So, subbands are overlapped The shape of six-triangle filters is indicated in Figure 3-2 SSCF parameters extraction of two successive frames of speech signal is described in Figure 3-3

Figure 3-2: The shape of six-triangle overlapped subband filters for computing SSCF

Figure 3-3: SSCF parameters extraction from a speech signal following frame by frame [13]

After considering six-triangle subband filters following the original definition, this section checks the accuracy of generated SSCFs A small database is recorded from four French native speakers (including 2 females – F1, F2 – and 2 males – M1, M2) with age from

25 to 30 years old They spoke Vowel-Vowel (V1V2) in four times with normal rate To find out the correctness of SSCF, this section only uses /ai/ transition signal from the French mentioned database One of the four comparison results between formants and SSCFs of speakers are shown in Figure 3-4 In the theory, when a person speaks from vowel /a/ to

Trang 27

vowel /i/, there is a clearly change in F1 and F2 value In detail, F1 will decrease and F2 will increase [25] [26] So, the SSCF1 and SSCF2 are also expected similar with F1 and F2, respectively Ignoring the SSCF0 and SSCF5 parameters, this analysis only focuses on SSCF1, SSCF2, SSCF3 and SSCF4 With the representation in Figure 3-4, Formants and SSCF are described with the same color

a)

b)

c)

d) Figure 3-4: Comparison results between formants and SSCFs in /ai/ when apply six subband filters: a) F1,

b) F2, c) M1 and d) M2

Trang 28

It is clearly that for male speakers, the shape between SSCF1-SSCF2 is similar with F1- F2 However, for female speakers, there is a significantly different value between SSCF2 and F2 while SSCF1value is similar with F1 In the detail, the value of SSCF2 is always smaller than F2 and there is almost no significant change in SSCF2 when transition is from /a/ to /i/ This result is completely wrong in theory of formant frequency Besides, the range of value in F3-F4 is always higher than SSCF3-SSCF4 both of males and females

These results demonstrate that the original definition of SSCF features is not suitable with the aim of this thesis Therefore, a new design for subband filters is extremely important for purpose of using SSCFs instead of formants After that, section 3.1.2 and 3.1.3 will analyze the effect of subband filter on SSCF features and describe a new proposal SSCF computing way to make results better and closer with Formant

3.1.2 Influence of subband filters on the SSCF features

Following the algorithm in Figure 3-1 points out that the cause for this wrong result is

in definition of subband filters It is argued that the subband filters directly affect the results of formant frequencies To find a solution for computing SSCF features, this section tries to change the number of subband filter but still retains the property about equal length subbands

in mel-scale The new number of subband filters is five Figure 3-5 shows five-triangle overlapped subband filters for computing SSCF

Figure 3-5: The shape of five-triangle overlapped subband filters for computing SSCF

The new SSCF results are evaluated on the same /ai/ database, which is used to evaluate

in experiment with the six-overlap triangle subband filters Figure 3-6 indicates a comparison between SSCF and formant when apply five-overlap triangle subband filters

Trang 29

a)

b)

c)

d) Figure 3-6: Comparison results between formants and SSCFs in /ai/ when apply five overlap

subband filters: a) F1, b) F2, c) M1 and d) M2

Trang 30

The results in Figure 3-6 are better than the results in Figure 3-5 in /ai/ transition It is clearly that for both male and female speakers, the shape between SSCF1-SSCF2 is similar with the shape between F1-F2, especially these results indicate clearly in transition from the vowel /a/ to vowel /i/ Nevertheless, for all speakers, there is a significantly different value between SSCF1 and F1 In the detail, the values of SSCF1 and SSCF4 are always much higher than F1 and F4, respectively

For a more objective comparison of the effect of the number of subband filters on SSCF results, this section did implement an extended experiment This extended experiment continued keeping the property of equal length disjoint on mel-scale as a specific condition for both of two subband filter types The evaluation method is described in Figure 3-7

Figure 3-7: The method for evaluation the effect of the number of subband filters on SSCF results

This section tested on four /aV/ different transitions, V is vowel, in two females and two males They are /ai/, /ae/, /ao/ and /au/ The comparison results are shown in Figure 3-8, Figure 3-9, Figure 3-10 and Figure 3-11, respectively

Trang 31

- /ai/ transition

a)

b)

Trang 32

c)

d) Figure 3-8: Comparison when using 5 or 6 Triangular Subband Filters in /ai/ transition: a) F1, b)

F2, c) M1 and d) M2

Trang 33

- /ae/ transition

a)

b)

Trang 34

c)

d) Figure 3-9: Comparison when using 5 or 6 Triangular Subband Filters in /ae/ transition: a) F1, b)

F2, c) M1 and d) M2

Trang 35

- /ao/ transition

a)

b)

Trang 36

c)

d) Figure 3-10: Comparison when using 5 or 6 Triangular Subband Filters in /ao/ transition: a) F1, b)

F2, c) M1 and d) M2

Trang 37

- /au/ transition

a)

b)

Trang 38

c)

d) Figure 3-11: Comparison when using 5 or 6 Triangular Subband Filters in /au/ transition: a) the first female (F1), b) the second female (F2), c) the first male (M1), d) the second male (M2)

Trang 39

The results in Figure 3-8, Figure 3-9, Figure 3-10 and Figure 3-11 show two main evaluations The first one is about the shape between SSCF1-SSCF2 and the second one is about the range of SSCF1- SSCF2 values These evaluations are based on a publication in [25] about /aV/ characteristics in the F1-F2 plane

a)

b)

c) Figure 3-12: [aV] trajectories for native French speakers at normal rate: a) F1-F2 plane at publishcation [25]; SSCF1-SSCF2 plane from measurement results when b) using 6 triangular

filters, c) using 5 triangular filters

Trang 40

Firstly, when using six-triangle subband filters, the shape between SSCF1-SSCF2 is more similar with the shape between F1-F2 in male situation than females Especially, the shape between SSCF1-SSCF2 is too different from the shape between F1-F2 in both of /ai/ and /ae/ transition for females This displays clearly in Figure 3-8 and Figure 3-9 In contrast, when using five-triangle subband filters, the shape between SSCF1-SSCF2 are similar with F1 and F2 for both of females and males

Secondly, let consider the trajectories of /aV/ formant in Figure 3-12a, and focus on /ai/, /ae/, /ao/ and /au/ It can be divided into two kinds of trajectories The first type is /ai/ and /ae/, the second type is /ao/ and /au/ Following publishcation in [25], when speaker speaks from vowel /a/ to vowel /i/ or /e/, F1 reduces (from about 800 Hz to about 250 Hz in /i/ and about

380 in /e/), while F2 increases from about 1400 Hz to about more than 2200 Hz In the second type, F1 and F2 are all down Comparing with values of SSCF1 and SSCF2, there is a significantly difference from two situations using triangle filters When 6 triangular filters are applied, it is easy to see that the SSCF2 is always lower than 2000 Hz (detailed in Figure 3-12b) It is reason to explain for why the shape between SSCF1-SSCF2 in six-triangle subband filters is completely incorrect with F1-F2 form in female situation It is because that the formant frequency of females is higher than males in the same vowel [25],[26], [27] With the second type (/ao/ and /au/), the shape of SSCF1-SSCF2 is quite similar with the shape of F1-F2 However, the distinction between the four transitions, which is presented in Figure 3-12b, is not clear When the five-triangle subband filter is applied, the trajectories between ai-ae and ao-au are much clearer In the detail, Figure 3-12c shows a SSCF1-SSCF2 plane with ai-ae transition is close to the theory However, the values of SSCF1 and SSCF4 are mostly much higher than F1 and F4, respectively

From above analysis, it can be verify that the directly effect of subband filter on the value of SSCFs In addition, these analyses are basis for designing a new proposal SSCF computing method in section 3.1.3

3.1.3 New proposal design of subband filters

The influence of the number of triangle subband filters on the SSCFs value is presented

in section 3.1.2 So what is the cause for these results? It is because the property of equal length in mel-scale In the detail, from the mel-scale (from 0 to mel(Fs/2)), if six-triangle subband filters are used, it needs seven equal distances, while six equal distances are needed if the system uses five-triangle subband filters The transformation of mel-scale M and Hz-frequency H can be defined by eq 3.2 and 3.3

10( ) 2595log 1

Ngày đăng: 09/03/2021, 20:45

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm