Since intonation forms such a central part of human speech communication, not only conveying diverse linguistic information, but also information about the speaker, the speaker’s mood an
Trang 1THESIS FOR THE DEGREE OF MASTER OF SCIENCE
Trang 2Acknowledgments
Firstly, I would like to express my gratitude to my supervisor, Dr Eric Castelli, whose expertise, understanding, patience, added considerably and constructively critical eye to my graduate experience
Special thanks go to Dr Nguyen Trong Giang and Dr Pham Thi Ngoc Yen for supporting me the best convenient conditions during my working time at International Research Center MICA
I would like to thank to PhD students Nguyen Viet Tung, Tran Do Dat,
Vu Minh Quang and Le Xuan Hung who helped me a lot in finishing the thesis
I would also like to thank my family, especially my parents for the support they provided me through my entire life, without whose care, encouragement I would not have finished this thesis
Finally, thanks go to all of my colleagues who helped me while I worked on this thesis
Trang 3Table of contents
Acknowledgments 1
List of Figures 4
List of Tables 7
Chapter 1 INTRODUCTION 8
Chapter 2 SPEECH PRODUCTION PROCESS 10
2.1 Introduction 10
2.2 Sound 12
2.3 Speech production 13
2.3.1 Articulators 13
2.3.2 The voicing mechanism 16
Chapter 3 AN OVERVIEW OF PROSODY 20
3.1 The concepts of prosody and intonation 20
3.2 Levels of representation of prosodic phenomena 20
3.3 The functions of prosody 22
3.4 Applications of intonation 24
Chapter 4 PROSODY IN VIETNAMESE 27
4.1 General characteristics of Vietnamese language 27
4.1.1 Phoneme system 27
4.1.2 Syllable structure 30
4.1.3 The tonal system 31
4.1.4 Tones in context 33
4.1.5 Modality, attitude and morphosyntactic structures 34
4.2 Some studies on Vietnamese prosody 36
Trang 4Chapter 5 FUNDAMENTAL FREQUENCY DETECTION 41
5.1 Introduction 41
5.2 Some pitch detection algorithms 43
5.2.1 The autocorrelation method 43
5.2.2 The average magnitude difference function method 46
5.2.3 The simple inverse filtering tracking method 48
5.2.4 The cepstrum-based method 49
5.3 The Praat pitch tracker 50
5.3.1 Introduction 50
5.3.2 Windowing and sampling problems 51
5.3.3 Evaluation 54
Chapter 6 EXPERIMENTAL INTONATION ANALYSIS 58
6.1 Objective 58
6.2 Speech corpus 59
6.3 Hypotheses 60
6.4 Experiments 62
6.4.1 First experiment 62
6.4.2 Second experiment 66
6.4.3 Third experiment 68
Chapter 7 CONCLUSION AND PERSPECTIVES 74
References 76
Appendix 78
A List of questions in the corpus 78
B List of statements in the corpus 80
Trang 5List of Figures
Figure 2.1 The underlying determinants of speech generation and understanding The gray boxes indicate the corresponding computer system components for spoken language processing [1] 12Figure 2.2 Application of sound energy causes alternating compression/refraction of air molecules, described by a sine wave [1] 13Figure 2.3 A schematic diagram of the human speech production apparatus 14Figure 2.4 Schematic representation of the complete physiological mechanism
of speech production [2] 16Figure 2.5 A section of waveform of the utterance “sa” The unvoiced sound
“s” in the first part and the voiced sound “a” in the second part 17Figure 2.6 Vocal fold cycling at the larynx (a) Closed with sub-glottal pressure buildup; (b) trans-glottal pressure differential causing folds to blow apart; (c) pressure equalization and tissue elasticity forcing temporary reclosure of vocal folds, ready to begin next cycle [1] 18Figure 2.7 Glottal airflow and the resulting sound pressure at the mouth [2] 19Figure 4.1 Example of the contours of six tones (female subject PNY), as described in [7] 32Figure 4.2 F0 variations of 2 typical pairs of sentences in [9]: 40Figure 5.1 Autocorrelation function for (a) and (b) voiced speech, and (c) unvoiced speech [10] 44Figure 5.2 Example of waveforms and correlation function: (a) no clipping, (b) center clipped [10] 46Figure 5.3 AMDF function for same speech segments as in Figure 5.1 [10] 47
Trang 6Figure 5.4 Block diagram of the SIFT algorithm [10] 48Figure 5.5 Cepstrum of an example segment of: (a) voiced speech, (b) unvoiced speech 49Figure 5.6 Windowing a signal and estimating the ACF of a signal segment from the ACF of its windowed version [15] 51Figure 5.7 Some F0 points are detected in the unvoiced consonant “kh” of the word “không” (female subject HT) 55Figure 5.8 Pitch halving errors in the middle of the word “trà” (female subject LH) 55Figure 5.9 Some F0 points are missed in the voiced consonant “b” of the word
“biết” (female subject LH) 55Figure 5.10 Some F0 points are missed in the middle of the word “rõ” (female subject VL) 56Figure 5.11 Some F0 points are missed at the end of the word “vậy” (male subject VN) 56Figure 6.1 Speech waveform (in the background) and F0 contour (blue dotted line) of the utterance “Bây giờ anh ở đâu?” (male subject ND) The final syllable “đâu” is bounded by two vertical lines 62Figure 6.2 F0 contour (blue dotted line) and proposed intonation contour (red
VN) 64Figure 6.3 The intonation contour (red dotted line) of the statement “Bà ấy làm giáo viên.” (male subject VN) 66
Trang 7Figure 6.4 The intonation contour (red dotted line) of the question “Bà có nhìn rõ không?” (male subject VN) 66Figure 6.5 F0 level of all speakers for questions (Q) and statements (S) 67Figure 6.6 Time waveform (top), F0 contour (middle) and the position of 5 representative points 69
Trang 8List of Tables
Table 3.1 Links between levels of representation of prosodic phenomena [3]
21
Table 3.2 Information conveyed by prosody, ‘*’ marking feature discussed in this study [4] 22
Table 4.1 Vietnamese vowels 28
Table 4.2 Vietnamese consonants 29
Table 4.3 Arrangement of Vietnamese consonants 30
Table 4.4 The phonological hierarchy of Vietnamese syllables with total numbers of each phonetic unit [6] 31
Table 4.5 The six Vietnamese tones 31
Table 5.1 Praat PDA evaluation for male speech and female speech 57
Table 6.1 Speakers’ information 60
Table 6.2 Statistics on F0 level of all speakers for questions (Q) and statements (S) including: mean, minimum (min), maximum (max) and standard deviation (std) 67
Table 6.3 Representative values of “ngang” tone in final position of questions and statements for all speakers 70
Trang 9Since intonation forms such a central part of human speech communication, not only conveying diverse linguistic information, but also information about the speaker, the speaker’s mood and attitude, it certainly ought to be useful in such above applications In the field of speech recognition, the more the task develops from the recognition of single words
in a limited vocabulary towards the understanding of complex utterances, the more suprasegmental features like intonation have to be taken into account These are important cues for the segmentation and classification (question vs statement, for instance) of utterances In speech synthesis, modeling intonational features is indispensable for increasing the intelligibility and naturalness of synthetic speech This is the reason I chose to study the characteristic of Vietnamese intonation in questions
Trang 10The thesis is organized as follow Chapter 2 gives a brief review of human speech production system and an introduction of some related fundamental concepts An overview of prosody, which includes intonation, is presented in Chapter 3 Chapter 4 describes the general characteristics of Vietnamese language and some studies on Vietnamese prosody Fundamental frequency, the acoustical correlate of intonation, and the problem of its estimation are provided in Chapter 5 Chapter 6 presents the experiments carried out in the work and the results obtained Finally, the conclusion and the perspectives of the study are given in Chapter 7
Trang 11Chapter 2 SPEECH PRODUCTION PROCESS
Spoken language is used to communicate information from a speaker to
a listener Speech production and perception are both important components
of the speech chain Speech begins with a thought and intent to communicate
in the brain, which activates muscular movements to produce speech sounds
A listener receives it in the auditory system, processing it for conversion to neurological signals the brain can understand The speaker continuously monitors and controls the vocal organs by receiving his or her own speech as feedback Considering the universal components of speech communication as shown in Figure 2.1, the fabric of spoken interaction is woven from many distinct elements The speech production process starts with the semantic message in a person’s mind to be transmitted to the listener via speech The computer counterpart to the process of message formulation is the application
Trang 12semantics that creates the concept to be expressed After the message is created, the next step is to convert the message into a sequence of words Each word consists of a sequence of phonemes that corresponds to the pronunciation of the words Each sentence also contains a prosodic pattern that denotes the duration of each phoneme, intonation of the sentence, and loudness of the sounds Once the language system finishes sentence, and loudness of the sounds Once the language system finishes the mapping, the talker executes a series of neuromuscular signals The neuromuscular commands perform articulatory mapping to control the vocal cords, lips, jaw, tongue, and velum, thereby producing the sound sequence as the final output The speech understanding process works in reverse order First the signal is passed to the cochlea in the inner ear, which performs frequency analysis as a filter bank A neural transduction process follows and converts the spectral signal into activity signals on the auditory nerve, corresponding roughly to a feature extraction component Currently, it is unclear how neural activity is mapped into the language system and how message comprehension is achieved in the brain
Trang 13Figure 2.1 The underlying determinants of speech generation and understanding The gray boxes indicate the corresponding computer system components for spoken language
Trang 14Figure 2.2 Application of sound energy causes alternating compression/refraction of air
molecules, described by a sine wave [1]
The use of the sine graph in Figure 2.2 is only a notational convenience for charting local pressure variations over time, since sound does not form a transverse wave, and the air particles are just oscillating in place along the line of application of energy The amount of work done to generate the energy that sets the air molecules in motion is reflected in the amount of displacement of the molecules from their resting position This degree of displacement is measured as the amplitude of a sound as shown in Figure 2.2
2.3 Speech production
2.3.1 Articulators
Speech is produced by air-pressure waves emanating from the mouth and the nostrils of a speaker In most of the world’s languages, the inventory
of phonemes can be split into two basic classes:
throat or obstructions in the mouth (tongue, teeth, lips) as we speak
Trang 15Figure 2.3 A schematic diagram of the human speech production apparatus
The sounds can be further partitioned into subgroups based on certain articulatory properties These properties derive from the anatomy of a handful
of important articulators and the places where they touch the boundaries of the human vocal tract Additionally, a large number of muscles contribute to articulator positioning and motion A schematic view of only the major articulators is diagrammed in Figure 2.3 The gross components of the speech production apparatus are the lungs, trachea, larynx (organ of voice production), pharyngeal cavity (throat), oral and nasal cavity The pharyngeal and oral cavities are typically referred to as the vocal tract, and the nasal cavity as the nasal tract As illustrated in Figure 2.3, the human speech production apparatus consists of:
oscillate against one another during a speech sound, the sound is
Trang 16said to be voiced When the cords are too slack or tense to vibrate periodically, the sound is said to be unvoiced The place where the vocal cords come together is called the glottis
passage of air through the nasal cavity Sounds produced with
mouth, which, when the tongue is placed against it, enables consonant articulation
vowels, placed close to or on the palate or other hard surfaces for consonant articulation
certain consonants
closed completely to stop the oral air flow in certain consonants (p, b, m)
Trang 17Figure 2.4 Schematic representation of the complete physiological mechanism of speech
production [2]
A simplified representation of the complete physiological mechanism for creating speech is shown in Figure 2.4 Air enters the lung via the normal breathing mechanism As air is expelled from the lung to the trachea (or windpipe), the tensed vocal cords within the larynx are caused to vibrate (in the mode of relaxation oscillator) by the air flow The air flow is chopped into quasi-periodic pulses which are the modulated in frequency in passing through the throat (pharynx cavity), the mouth cavity, and possibly the nasal cavity Depend on the positions of the various articulators, different sounds are produced
2.3.2 The voicing mechanism
The most fundamental distinction between sound types in speech is the
voiced/unvoiced distinction Voiced sounds, including vowels, have in their
time and frequency structure a roughly regular pattern that voiceless sounds,
Trang 18such as consonants like s, lack Voiced sounds typically have more energy as shown in Figure 2.5 We see here the a part of the waveform of the utterance
“sa”, which consists of two phonemes: unvoiced consonant /s/ and vowel /a/
Figure 2.5 A section of waveform of the utterance “sa” The unvoiced sound “s” in the
first part and the voiced sound “a” in the second part
What in the speech production mechanism creates this fundamental distinction? When the vocal folds vibrate during phoneme articulation, the
phoneme is considered voiced; otherwise it is unvoiced Voiced sounds are
produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract So the resulting
speech waveform is quasi-periodic Unvoiced sounds are generated by
forming a constriction at some point in the vocal tract (usually toward the mouth end), and forcing air through the constriction at a high enough velocity
to produce turbulence This creates a broad-spectrum noise source to excite the vocal tract So the resulting speech waveform is aperiodic or random in nature The vocal folds vibrate at slower or faster rates, from as low as 60 cycles per second (Hz) for a large man, to as high as 300 Hz or higher for a small woman or child The rate of cycling (opening and closing) of the vocal folds in the larynx during phonation of voiced sounds is called the
fundamental frequency or F 0 This is because it sets the periodic baseline for all higher-frequency harmonics contributed by the pharyngeal and oral
well with the subjective experience of pitch (the rising and falling of voice
Trang 19tones) It is therefore common practice to use the terms F0 and pitch interchangeably, and in the remainder of the thesis I will do the same
Figure 2.6 Vocal fold cycling at the larynx (a) Closed with sub-glottal pressure buildup; (b) trans-glottal pressure differential causing folds to blow apart; (c) pressure equalization and tissue elasticity forcing temporary reclosure of vocal folds, ready to begin next cycle [1]
The glottal cycle is illustrated in Figure 2.6 At stage (a), the vocal folds are closed and the air stream from the lungs is indicated by the arrow At some point, the air pressure on the underside of the barrier formed by the vocal folds increases until it overcomes the resistance of the vocal fold closure and the higher air pressure below blows them apart (b) However, the tissues and muscles of the larynx and the vocal folds have a natural elasticity which tends to make them fall back into place rapidly, once air pressure is temporarily equalized (c) The successive airbursts resulting from this process are the source of energy for all voiced sounds The time for a single open-close cycle depends on the stiffness and size of the vocal folds and the amount
of sub-glottal air pressure These factors can be controlled by a speaker to raise and lower the perceived frequency or pitch of a voiced sound
The glottal air flow (volume velocity waveform) and the resulting sound pressure at the mouth for a typical vowel sound is shown in Figure 2.7 The glottal waveform shows a gradual build-up to a quasi-periodic pulse train
Trang 20of air, taking about 15 ms to reach steady state This build-up is also reflected
in the acoustic waveform shown at the bottom of the figure
Figure 2.7 Glottal airflow and the resulting sound pressure at the mouth [2]
Trang 21Chapter 3 AN OVERVIEW OF PROSODY
3.1 The concepts of prosody and intonation
The term prosody (ngôn điệu) refers to certain properties of the speech
signal such as audible changes in pitch, loudness, and syllable length [3] For some authors the set of prosodic features also includes other aspects related to speech timing such as rhythm and speech rate
Because prosodic events appear to be time-aligned with syllables or groups of syllables rather than with segments (phonemes), they are also referred to as suprasegmental phenomena
prosody It is restricted to the tonal (melodic) aspects of prosody by others [3] (it will be the case here, too) It means that, in the thesis, intonation refers to pitch variation in speech production and is part of prosody
3.2 Levels of representation of prosodic phenomena
As for other properties of the speech signal, prosodic events can be studied at various levels of representation (see Table 3.1):
(fundamental frequency, amplitude, and duration) can be measured directly, using specialized hardware or algorithms (such as pitch determination algorithms)
Trang 22• Second, the perceptual level represents the prosodic events as
heard by the listener As for spectral properties of speech sounds, acoustic characteristics that can be measured are not always perceptible The perceptual representation is accessible to the individual listener, but this mental representation can hardly be measured Alternatively it can be computed with a fair amount of precision on the basis of our knowledge about psychoacoustics
as a sequence of abstract units (signs, symbols), some of which have a communicative function in speech, while others may just fulfill syntactic requirements The linguistic structure of prosody
is not some hidden code that simply can be revealed using some standard procedure
Table 3.1 Links between levels of representation of prosodic phenomena [3]
Acoustic Perceptual Linguistic
Fundamental frequency
(F0)
As one moves away from acoustic level towards the perceptual and/or linguistic levels, the measurement of some given prosodic property will progressively involve segmentation (for example, into syllables), context (such as relative prominence), and structural information (the linguistic interpretation of a syllabic tone, for example, often depends on whether the related syllable is stressed or not, which requires a prior analysis of the segmental layer)
Trang 233.3 The functions of prosody
The functions of prosody can be distinguished into those which modify meaning and those which do not (see Table 3.2) The former could also be seen as the part of information which is consciously and intentionally provided by the speaker, the message, whereas the latter involuntarily accompanies it
Table 3.2 Information conveyed by prosody, ‘*’ marking feature discussed in this study
speaker's intention, attitude
age sex speaker's background (native language, dialect) emotional condition Prosodic features have specific functions in speech communication One of the most uncontroversial functions of intonation is that of conveying
different illocutionary aspects, or sentence modes Thus it is commonly
maintained that a distinction between declarative and interrogative modes is one of the most universal characteristics of intonation systems My work is centered around the contribution of prosody to the expression of interrogative mode
One of the most apparent effects of prosody is that of focus For instance, certain pitch events make a syllable stand out within the utterance, and indirectly the word or syntactic group it belongs to will be highlighted as
an important or new component in the meaning of that utterance The presence of a focus marking may have various effects, such as contrast,
Trang 24depending on the place where it occurs, or the semantic context of the utterance
Prosodic features create a segmentation of the speech chain into groups
of syllables, or, put the other way round, they give rise to the grouping of syllables and words into larger chunks
All these aspects of intonation can be grouped under the header of
linguistic aspects of intonation They are part of the structure of language (and
specific to any given language) in the same way as morphology and syntax are The linguistic features concern the way a message is formally coded and organized into intonational units of a certain language They correspond to the surface structure of the message on a still rather abstract level The actual meaning of the message can often not be decoded without interpreting the
underlying paralinguistic information Paralinguistic information is defined as
the information that is not inferable from the written counterpart but is deliberately added by the speaker to modify or supplement the linguistic information A written sentence can be uttered in various ways to express different intentions and attitudes which are under the conscious control of the speaker The question “Are you tired?”, for instance, is simply a request for being supplied information on someone’s psychological and physiological condition If it is asked with a concerned undertone then the message may be:
“Come on, you’ve been working so hard, you have to get yourself some sleep!” With an ironical undertone, it may mean “You lazy guy, you’ve been sleeping all day and still you’re tired!”
There is, however, another range of phenomena that are also expressed
by prosodic means (such as pitch), but do not modify the meaning of a message They can convey information about the age, gender, the emotional
Trang 25or physical state of the speaker These factors are not directly related to linguistic and paralinguistic contents of the utterances and cannot generally be controlled by the speaker Angry people, for instance, usually have faster pitch changes, a larger pitch range, and a larger dynamic amplitude range; whereas depressed people typically show the opposite trend But while the pitch range may be affected by such emotional factors, the basic functional pitch shapes and configurations remain unaffected The emotional state does not alter the linguistic code; it merely affects its realization This is why these
aspects are called non-linguistic aspects
The understanding of information conveyed by intonation is important for intonation study Each type of information has its effect on tonal variations, i.e intonation These effects need to be taken into account in intonation analysis
3.4 Applications of intonation
Since intonation forms such a central part of human speech communication, not only conveying diverse linguistic information, but also information about the speaker, the speaker’s mood and attitude, it certainly ought to be useful in many applications Apart from language technology and speech synthesis, where intonation is an established application, diverse areas
of medical as well as educational applications where intonation is less commonplace are being developed
Speech processing:
The increasing demand for the application of speech in man-machine communication in all areas ranging from telephony, telematics, and automated translation to aids for the handicapped requires sophisticated technology for the analysis, recognition and synthesis of speech In the field
Trang 26of speech recognition, the more the task develops from the recognition of
single words in a limited vocabulary towards the understanding of complex utterances, the more suprasegmental features like intonation have to be taken into account These are important cues for the segmentation and classification (question vs declaration, for instance) of utterances In this context, modeling intonation is an important task
In speech synthesis, modeling intonational features is indispensable for
increasing the intelligibility and naturalness of synthetic speech Sophisticated
sentence at a given speech rate
Automatic language identification could be important especially in
different telecom applications, when the spectral content of the speech could
be expected to be distorted Intonation cues are in this case especially interesting The varied intonational structure of languages could be exploited
in this application In this recognition task, the intonation cues need to be combined with other types of information
Speech Pathology:
Hearing impairments, especially if they are congenital or acquired at an early age, are accompanied by a reduced intelligibility of speech Major factors for this are an imperfect command of phonatory effort and a lack of control of the laryngeal function which result in a high degree of variation in the pitch patterns produced The speech of hearing impaired people may sound monotonous or on the contrary excessively emotional The basic pitch
is often kept on a level either too high or too low and it was observed that hearing impaired persons have difficulties in changing their pitch within a
Trang 27single syllable Teaching aids have been developed to overcome these problems which provide a feedback for pitch over tactile or visual channels
Foreign Language Education:
It is widely agreed that the acquisition of a good command of intonational features in a foreign language is one of the most difficult tasks a student must accomplish Yet it is crucial for the degree of intelligibility he or she will achieve In traditional language education, however, intonation usually comes second to segmental phonetics, which itself forms only a small part in the curriculum of common language courses This deficit has become more apparent as the political and economical globalization requires better communicative skills on the part of the learner of a foreign language In this context, individual computer-based language education will play a further growing role Software is needed which is fit for the special problems of the speaker of a language L1 who studies a target language L2 Although a number of programs exist whereby the student can train his lexical, grammatical or orthographic skills, there are few systems which use speech input to help correct the student's pronunciation In this context, visualization
of speech can provide additional feedback where the auditory channel fails, because of the mother tongue interference
It seems desirable to develop more intelligent systems which are customized to the special requirements of students with the same native language Contrastive studies of the intonational systems of L1 and L2 can help to predict problems and select appropriate teaching materials
Trang 28Chapter 4 PROSODY IN VIETNAMESE
The understanding of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general and on intonation analysis in particular This chapter provides a review of characteristics of Vietnamese language and some works of other authors related to my study
4.1 General characteristics of Vietnamese language
Vietnamese is known as a tonal language which uses tone to distinguish lexical meaning Vietnamese has basically six lexical tones Each tone could
mẽ, mẹ It is not the case for non-tonal languages In English, for example, the position of the stressed syllable within a word is lexically distinctive
Trang 29Table 4.1 Vietnamese vowels
Transcription Reading Letters Example
Vietnamese includes 22 consonants [5]:
Trang 30Table 4.2 Vietnamese consonants
Transcription Reading Letter Example
Trang 31of consonant in syllable Based on these features, Vietnamese consonants can
varieties of Vietnamese, the whole tonal paradigm can occur
Trang 32Table 4.4 The phonological hierarchy of Vietnamese syllables with total numbers of each
phonetic unit [6]
TONAL SYLLABLE (6492)
BASE SYLLABLE (2376) Initial
(22)
Final (155)
Medial (1) Nucleus (16) Ending (8)
TONE (6)
4.1.3 The tonal system
There are six syllabic tones in Vietnamese (see Table 4.5) To describe the tonal system on a physical basis, most linguists have studied tones in isolated syllables where they are likely to be realized as close as possible according to their phonotype In term of distinctive features, Vietnamese tones can be described according to register, contour and glottalization (the complete or partial closure of the glottis during the articulation of another sound) These tones can be separated into two groups according to register:
“ngang”, “sắc”, “ngã” are realized in a higher register while “huyền”, “nặng”,
“hỏi” are realized in a lower one Based on glottalization feature, these six tones can be classified into two groups: “ngã” and “nặng” tones are
F0 contours of the six Vietnamese tones (examples are shown in Figure 4.1), are described as follows [6]:
Table 4.5 The six Vietnamese tones
Trang 33Figure 4.1 Example of the contours of six tones (female subject PNY), as described in [7]
syllable, it is the highest tone The steady state of the level contour is observed consistently
lower than tone 1, tone 5 and tone 3 The low F0 at the onset gradually falls toward the end
level of tone 5, it is higher than the falling tone The second third
of the contour of this tone is characterized by an abrupt dip caused by a glottalization In most cases, the bottom of the dip occurs between the mid-point and the point two-thirds from onset A creaky voice is heard during this dip
six tones The low onset falls further gradually until the point two-thirds from the onset From this point, the extremely low F0 starts to rise toward the end
Trang 34• Tone 5 - Rising tone (“sắc”): the onset is also high Starting from high onset, the F0 gradually rises for the first two thirds of the duration After this point, the rise becomes more rapid
of the falling or curve tone but considerably lower than the tone
1, tone 5 and tone 3 This tone is characterized by a glottalization
at the end and also by its considerably shorter duration than the other tones The duration of this tone is approximately two thirds
of the other tones The main body of this tone is almost leveled
or slightly falling
These descriptions are only for the Northern dialect, in particular Hanoi dialect which is the standard dialect of Vietnamese They would be changed with the other dialects in the South and the Center of Vietnam In these regions, there are only 5 tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced identically
4.1.4 Tones in context
In continuous speech, tones seldom reach their target values They are generally affected by context: stressed vs unstressed syllable, influence of neighbouring tones, tempo… These influences have rarely been studied Tonal variation due to the influence of neighbouring tones is described by linguists as a type of tonal coarticulation Đỗ Thế Dũng [8] observed that after
a rising tone such as “sắc” or “ngã”, any immediately following tone will start one or two quarter tones higher than its normal target value, and after a falling tone such as “nặng” or “huyền” it will start one or two quarter tones lower This variation is stronger in unstressed positions than in stressed ones, and in spite of this, a relative difference in register and contour is preserved
Trang 354.1.5 Modality, attitude and morphosyntactic structures
In Vietnamese there are two possible ways of expressing modality, mood or attitude, the first only using prosodic features, and the second using lexicon-syntactic markers, possibly combined with prosodic features [8] In the first case, as the pragmatic information relies entirely on prosodic structure, it has to be clearly marked In the second case, as intonation become redundant, it is interesting to see if it can still play a role in characterizing the pragmatic type
The Vietnamese language has a system of syntactic markers which occur mostly at the end (occasionally at the beginning or in the middle) of a declarative sentence They are used to express modal and attitudinal meanings For example, from a declarative sentence
Trời mưa
we may obtain a yes-no question by adding “không”:
Trời mưa không?
With another morpheme “à”, we obtain a question expressing the speaker’s surprise:
Trời mưa à?
The morphosyntactic elements can be put into three classes according
to their semantic values: question, imperative and attitudinal markers
4.1.5.1 Question markers
It has been considered that only questions with morphosyntactic markers express simple interrogative modality in Vietnamese, and that questions with only prosodic markers are always interrogatives expressing
Trang 36surprise or astonishment and cannot be considered a “neutral” interrogative type [8] Some controversies remain about the classification of interrogative markers It seems, however, reasonable to distinguish two types of question
Yes-no questions use the following markers: “không” expresses a question on the predicative relation itself, for instance “Trời mưa không?”;
“chưa” has an aspectual value, for example “Trời mưa chưa?”; “hay” give an explicit alternative choice, for example “Trời mưa hay trời nắng?”
Open questions use indefinite words in the same way as wh-markers: ai (who), bao giờ (when), bao lâu (how long), bao nhiêu (how many), bao xa (how far), đâu (where), gì (what), mấy (how many), mấy giờ (at what time), nào (which or what), như thế nào (how), sao/tại sao/vì sao (why), sao/làm sao (how)
Some linguists have also mentioned a third type of question called biased questions (suggesting an expected answer) which are associated with the expression of an attitude They are syntactically marked with the final morphemes “à, ư” (surprise), “chứ” (logical evidence), “hả, hử, hở” (insisting and astonishment), “nhé” (supposition, suggestion)
4.1.5.2 Injunctive markers
Injunction is expressed by the presence of “đi” at the end of a declarative structure, for instance “Trời mưa đi!”
A weaker injunction is expressed with “nhé” and a stronger (insisting)
is expressed with the compound marker “hãy…đi”
4.1.5.3 Attitude and emotional markers
In Vietnamese, a final marker can be used to express speaker attitude
Lê Thị Xuyến gave the following list: “ạ” (respect), “đấy” (admiration), “rồi”
Trang 37(conclusive), “mà” (insistence), “sao” (surprise), “chăng” (doubt), “hả” (anger), “nhỉ” (familiarity), “vậy” (external obligation) [8]
4.2 Some studies on Vietnamese prosody
In a tonal language like Vietnamese with six lexical tones which moreover has a system of morphosyntactic markers to express emotions, attitudes, mood and modality, it would not be surprising if intonation play a lesser role than in non-tonal languages such as French or English: what is usually conveyed by intonation in many other languages is already marked This idea was developed by Gordina and Bystrov: “the more a language uses morphosyntactic or syntactic means to express mood, modality and emotions, the less it would rely on intonation for the same functions” [8]
This explains why there are very few studies on intonation in Vietnamese There are a few remarks in general grammar books The statements about intonation made by grammarians or linguists are rather intuitive, not based on experimental description For example, declarative sentences are said to be “falling” with such descriptive terms as “fading” or
“decreasing” (Thompson), “falling” (Nguyễn Đăng Liêm), “normal” or “low pitch” (Jones and Huỳnh Sanh Thông); whereas interrogative sentences are said to be “rising” (Nguyễn Đăng Liêm), “sustaining” (Thompson), “higher pitch level 1” (Jones and Huỳnh Sanh Thông)… Expressive sentences on the other hand are said to have a rising contour with a higher pitch level: “higher pitch level 2 or 3” (Jones and Huỳnh Sanh Thông), “increasing” (Thompson),
“rising-falling” (Nguyễn Đăng Liêm) [8]
There are a small number of experimental studies by Gordina and
given some ideas of the role and function of intonation in Vietnamese
Trang 38According to Gordina and Bystrov, the shorter the sentence, the greater the difference between the intonation patterns [8] In their examples:
(a) Anh ấy đi sang nước Anh à?
(b) Anh ấy đi sang nước Anh
(c) Không sách à?
(d) Không sách
the difference is greater between the (c) and (d) patterns than between (a) and (b) though in each case a declarative is contrasted with an interrogative
According to these same authors, an interrogative without a morphosyntactic marker has a well differentiated pattern when compared to
an interrogative with a marker In:
‘Mưa.’
‘Trời mưa.’
‘Cô ta xinh.’
Trang 39‘Khuya rồi.’
‘Nam về lúc khoảng một rưỡi.’
Each sentence was read with different attitudes by 2 speakers (one male, one female) and judged by 20 hears Results of her experiments showed that only irony, anger and statement were identified above chance level (75%, 52.5%, 67.5% respectively) According to her, the neutral declarative is characterized by a low register and a moderate tempo; irony has a higher register, a larger tone movement and a slower tempo resulting in increased sentence length; whereas anger is conveyed by a speeding up of tempo, greater and more abrupt pitch movement, shortening of the utterance and an increase in the overall intensity
In order to bring out the pertinent prosodic features corresponding to assertive, interrogative and imperative modes, while excluding attitudinal variations, and to produce natural Vietnamese utterances, Nguyễn and Boulakia [9] used a certain number of utterance pairs in which the final question or imperative marker can be replaced by a homonymous lexical item The resulting pairs have the same syllabic and tonal structures but differing morphosyntactic structures They are therefore considered to be ambiguous and if they are discriminated, it has to be due to the presence of prosodic differences Some example pairs of sentences in their corpus are as follows:
Statement – Question:
Lan thích ăn cơm không (Lan only likes to eat rice.) Lan thích ăn cơm không? (Does Lan like to eat rice?) Statement – Imperative:
Trang 40Bảo cố gắng tập đi (Bảo is making an effort to practice walking.)
Bảo cố gắng tập đi! (Bảo, make an effort to practice!) Question – Imperative:
Tân bỏ đi chứ? (Tân, did you leave?) Tân bỏ đi chứ! (Tân, do leave it!) From five morphemes “không”, “hả”, “sao”, “chứ”, “đi”, nine such pairs of sentences were formulated These 18 sentences were read by 4 speakers (2 males and 2 females) and judged by 22 hearers The results of prosodic analysis showed that questions are shorter than statements and this difference is significant Imperatives are even shorter but the difference with questions is not significant In terms of intensity, the difference is significant for the statement-imperative pair, but not for the statement-question and question-imperative ones About the intonation, the two members of the same pair have the same overall F0 contour but there is a difference in terms of register: the register of questions and imperatives is clearly higher than that of statements, while there is no difference between questions and imperatives (Figure 4.2) There is an obvious difference in the last syllable: the “ngang” tone falls in statements and is much higher and rising in questions, while the mean value and movement is half way between for imperatives The rising tones, “sắc” or “hỏi”, rise even more in the case of questions than in statements, while tend to become flat or even fall slightly in the final part in imperatives It means that there is an influence of the intonation on the final-syllable tone of the sentence