13 Figure 2.3 A schematic diagram of the human speech production apparatus 14 Figure 2.4 Schematic representation of the complete physiological mechanism Figure 2.5 A section of waveform
Trang 1THESIS FOR THE DEGREE OF MASTER OF SCIENCE
CHARACTERIZATION OF VIETNAMESE
INTONATION FOR QUESTIONS
NINH KHANH DUY
Supervisor: Dr ERIC CASTELLI
HA NOT 2005
Trang 2Acknowledgments
Firstly, | would like to express my gratitude to my supervisor, Dr Kric
Castelli, whose expertise, understanding, patience, added considerably and
constructively critical eye to my graduate expertence
Special thanks go to Ir Nguyen ‘lrong Giang and Dr Pham ‘Thi Ngoc
Yen for supporting me the best convenient conditions during my working
time al International Research Center MICA
I would like to thank to PhD students Nguyen Viet Tung, Tran Do Dat,
Vu Minh Quang and le Xuan Hung who helped me a lot in finishing the thesis
I would also like to thank my family, especially my parents for the
support they provided me through my cntire life, without whose care, encouragement | would not have finished this thesis
Finally, thanks go to all of my colleagues who helped me while I worked on this thesis.
Trang 33.1 The concepts of prosody and intonation 20 3.2, Levels of representation of prosodic phehommena 20
3.4 Applications of intonation seo "— Chapter 4 PROSODY LN VITTNAMUSH, 27 4.1 Genoral charactcristios of Vietnamesc languape 27
4.1.5 Modality, altitnde and morphoeyntactic stmnetures, - 34
Trang 4Chapler 5 FUNDAMENTAT FREQUENCY DETECTION,
5.1 Introduction
5.2 Some pilch detection algonthms
5.2.1, The autocorrelation method
5.2.2 The average magnitude difference function method
5.2.3 The simple inverse filtering tracking method
5.2.4 The cepetrrm-based method
3.3 'The Praat pitch traokeer S222 c2 me
A List of questions in the corpus ă
B List of statements in the corpus
4I 4I
43
43
Trang 5understanding The gray boxes indicate the corresponding computer
system components for spoken language processing [1 | 12 Figure 22 Application of sound cnergy causes allernaling
compression/refraction of air molecules, described by a sine wave [1] 13 Figure 2.3 A schematic diagram of the human speech production apparatus 14 Figure 2.4 Schematic representation of the complete physiological mechanism
Figure 2.5 A section of waveform of the utterance “sa” The unvoiced sound
“s” in the first part and the voiced sound “a” in the second part 17 Figure 2.6 Vocal fold cycling at the larynx (a) Closed with sub-glottal
pressure buildup; (b) trans-glottal pressure differential causing folds to
blow apart; (c) pressure cqualization and tissue clasticity oreing temporary reclosure of vocal folds, ready to begin next cycle [1] 18
Figure 2.7 Glottal airflow and the resulting sound pressure at the mouth [2].19 Figure 4.1 Example of the contours of six tones (female subject PNY), as
Figure 4.2 FO variations of 2 typical pairs of sentences in [9] 40
Figure 5.1 Autocorrelation function for (a} and (b) voiced speech, and (¢}
Figurc 5.2 Example of waveforms and correlation function: (a) ne clipping,
Figure 5.3 AMDF function for same speech segments as in Figure 5.1 [10] 47
Trang 6“
Figure 5.4 Block diagram of the SIFT algorithm [10] - 48
Tigure 5.5 Cepstrum of an example segment of: (a) voiced speech, (b)
Figure 5.6 Windowing a signal and estimating the ACK of a signal segment
from the ACF of its windowed version [15] 51 Figure 5.7 Some FO points are detected in the unvoiced consonant “kh” of the
Tigure 5.8 Pitch halving errors in the middle of the word “tra” (female subject
Figure 5.9 Some FO points are missed in the voiced consonant “b” of the word
Figure 5.10 Some FO points are missed in the middle of the word “rd” (female
Figure 5.11 Some FO poinis are missed al the end of the word “vay” (male
Figure 6.1 Speech waveform (in the background) and KO contour (blue dotted
line) of the utterance “Bay gio anh & dau?” (male subject ND) The final
syllable “dau” is bounded by lwo vertical lines - 62
Figure 6.2 FO contour (blue dotted linc) and proposed intonation contour (red
dotted line) of the utterance “Hién tai anh lam viéc 6 diu?” (male subject
Tigure 6.3 The intonation contour (red dotted line) of the statement “Ba ay
Trang 7Figure 6.5 FO level of all speakers for questions (Q) and statements (8) 67 Figure 6.6 ‘Time waveform (top), 0 contour (middle) and the position of 5
Trang 8Table 4.3 Arrangement of Vietnamese consonants 30
Table 4.4 The phonological hierarchy of Vietnamese syllables with total
Table 5.1 Praat PDA evaluation for male speech and female speech $7
Table 6.2 Statistics on FO level of all speakers for questions (Q) and
statements (S$) including: mean, minimum (min), maximum (max) and
Table 6.3 Representative values of “ngang” tone in final position of questions
and staterments for all speakes 70
Trang 9Vocal technologies are important and strategic in the development of information technology The increasing demand for the application of speech
in man-machine communication in all arcas ranging [rom telephony,
telematics, and automated translation to aids for the handicapped requires
sophislicaled tochnology lor the recognition and synthesis of specch
However, to carry out automatic modules of specch synthesis or speech
recognition for a given language, it is essential to know perfectly the
characteristic of the language, particularly in tcrm of phonetics and phonology
Smmce intonation fonms such a central part of human speech communication, not only conveying diverse linguistic information, but also
information about the speaker, the speaker’s mood and attitude, it certainly
ought to be useful in such above applications In the field of speech recognition, the more the task develops from the recognition of single words
in a limited vocabulary towards the understanding of complex utterances, the
more suprasegmental features like intonation have to be taken into account
These arc important cues for the segmentation and classification (question vs statement, for instance) of utterances In speech synthesis, modeling
intonational features is indispensable for increasing the intelligibility and
naturalness of synthetic speech ‘This is the reason | chose to study the
characteristic of Vietmamese intonation in questions.
Trang 10The thesis is orgamzed as fallow Chapter 2 gives a brief review of
human specch production system and an introduction of” some related
fundamental concepts An overview of prosody, which includes intonation, is
presonted in Chapter 3 Chapter 4 describes the genoral characteristics of Victnamese language and some studics on Victnamese prosody Fundamental
frequency, the acoustical correlate of intonation, and the problem of its
estimation are provided in Chapter 5 Chapter 6 presents the experiments
carried out in the work and the results obtained Finally, the conclusion and
the perspectives of the study are given in Chapter 7
Trang 112.1 Introduction
As we will see, intonation is based on the vibration of the vocal folds,
which is an inherent characteristic of the speech production process and thus,
in other words, once there is speech there is normally intonation too So the
understanding of speech production process is necessary for the
understanding of intonation formation In this chapter, a brief review of
human speech production system and the introduction of some fundamental concepts used in the thesis will be given
Spoken language is used to communicate information from a speaker to
a listener Specch production and perception arc both important componenis
of the speech chain Speech begins with a thought and intent to communicate
in the brain, which activates muscular movements Lo produce speech sounds
A listener reecives it in the audilory system, processing il for conversion to
neurological signals the brain can understand The speaker continuously
monilors and controls the voeal organs by receiving his or her own speech as
feedback Considering the universal components of specch communication as
shown in Figure 2.1, the fabric of spoken interaction is woven from many
dislinet clements The specch production process slaris with the semantic
message in a person’s mind to be transmitted to the listener via speech ‘Ihe
computer counterpart to the process of message formulation is the application
Trang 12semanties that creates the concept to be expressed Alter the message is
created, the next slep is to convert the message into 4 sequence of words
Each word consists of a sequence of phonemes that corresponds to the
pronunciation of the words Each scntence also contains a prosodic pation
that denotes the duration of cach phoneme, intonation of the sentence, and
loudness of the sounds Once the language system finishes sentence, and
loudness of the sounds Once the language system finishes the mapping, the
talker executes a series of neuromuscular signals ‘The neuromuscular
commands perform articulatory mapping to control the vocal cords, lips, jaw,
tongue, and velum, thereby producing the sound sequence as the final output
The speech understanding process works in reverse order First the signal is
passed Lo the cochlea in the inner ear, which performs frequency analysis
a
filter bank A neural transduction process follows and converts the spectral
signal into activity signals on the auditory nerve, corresponding roughly to a
feature extraction component Currently, it 18 unclear how neural activity is
mapped into the language system and how message comprehension is
achieved im the brain.
Trang 13Speech Generation Speech Understanding
Cozhles Motion
Figure 2.1 The underlying determinants of speech generation and understanding The gray
boxes indicate the corresponding computer system components for spoken language
processing [1]
2.2 Sound
Sound is a longitudinal pressure wave formed of compressions and
rarefactions of air molecules, in a direction parallel to that of the application
of energy Compressions are zones where air molecules have been forced by the application of energy into a tighter-than-usual configuration, and rarefactions are zones where air molecules are less tightly packed The alternating configurations of compression and rarefaction of air molecules along the path of an energy source are sometimes described by the graph of a sine wave as shown in Figure 2.2 In this representation, crests of the sine curve correspond to moments of maximal compression and troughs to
moments of maximal rarefaction.
Trang 14Figure 2.2 Application of sound energy causes alternating compression/refraction of air
molecules, described by a sine wave [1]
The use of the sine graph in Figure 2.2 is only a notational convenience for charting local pressure variations over time, since sound does not form a
transverse wave, and the air particles are just oscillating in place along the
line of application of energy The amount of work done to generate the energy
that sets the air molecules in motion is reflected in the amount of
displacement of the molecules from their resting position This degree of displacement is measured as the amplitude of a sound as shown in Figure 2.2
2.3 Speech production
2.3.1 Articulators
Speech is produced by air-pressure waves emanating from the mouth and the nostrils of a speaker In most of the world’s languages, the inventory
of phonemes can be split into two basic classes:
* consonants - articulated in the presence of constrictions in the throat or obstructions in the mouth (tongue, teeth, lips) as we speak
vowels - articulated without major constrictions and obstructions
Trang 15
Velum
hare
front
Vocal Cos VÔ TT, `
Figure 2.3 A schematic diagram of the aman speech production apparatus
The sounds can be further partitioned into subgroups based on certain
articulatory properties These properties derive from the anatomy of a handful
of important articulators and the places where they touch the boundaries of the human vocal tract Additionally, a large number of muscles contribute to
articulator positioning and motion A schematic view of only the major
articulators is diagrammed in Figure 2.3 The gross components of the speech
production apparatus are the lungs, trachea, larynx (organ of voice
production), pharyngeal cavily (throat), oral and nasal cavity The pharyngeal
and oral cavities are typically referred to as the vocal tract, and the nasal cavity as the nasal tract As illustrated in Figure 2.3, the human speech
production apparatus consists of
« Lungs: source of air during speech
* Vocal cords: when the vocal cords are held close together and
oscillate against one another during a speech sound the sound is
Trang 16said to be voiced When the cords are too slack or tense to
vibrate periodically, the sound is said to be unvoiced The place
where the vocal cords come together is called the glottis
Velum (Soft palate): operates as a valve, opening to allow
passage of air through the nasal cavity Sounds produced with
the flap open melude m and x
Hard palate: a long relatively hard surface at the roof inside the
mouth, which, when the tongue is placed against it, enables
consonant arliculalion
Tongue: Moxible articulator, shaped away from the palate for
vowels, placed close to or on the palate or other hard surfaces for
consonant articulation
Teeth: another place of articulation used to brace the tongue for
cortaim consonanls
Lips: can be rounded or spread to affect vowel quality, and
closed completely to stop the oral air flow in certam consonants
Œ, 6, m)
Trang 17— Í tmanxmx `
LUNG VOLUME
Figure 2.4 Schematic representation of the complete physiological mechanism of speech
production [2]
A simplilicd representation of the complete physiological mechanism for creating speech is shown in Figure 2.4 Air enters the lung via the normal
breathing mechanism As air is expelled from the lung to the trachea (or
windprpe), the tensed vocal cords within the larynx are caused to vibrate (in
the mode of relaxation oscillator) by the air flow The air flow is chopped into quasi-periodic pulses which are the modulated in frequency in passing
through the throat (pharynx cavity), the mouth cavity, and possibly the nasal
cavity Depend on the positions of the various articulators, different sounds
are produced
2
The voicing mechanism
The most fundamental distinction between sound types in speech is the
voicedunveiced distinclion Voieed sounds, including vowels, have in their
time and frequency structure a roughly regular pattern that voiceless sounds,
Trang 18Figure 2.5 A section of waveform of the utterance “sa” The unvoiced sownd *“s” im the
Jirst part and the voiced sound “a” in the second part
What in the speech production mechanism creates this fundamental distinction? When the vocal folds vibrate during phoneme articulation, the phoneme is considered voiced, otherwise it is unvoiced Voiced sounds are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract So the resulting speech waveform is quasi-periodic, Unvoiced sounds are generated by
forming a constriction at some point in the vocal tract (usually toward the
mouth end), and forcing air through the constriction at a high enough velocity
to produce turbulence This creates a broad-spectrum noise source to excite the vocal tract So the resulting speech waveform is aperiodic or random in
nature The vocal folds vibrate at slower or faster rates, from as low as 60
cycles per second (Hz) for a large man, to as high as 300 Hz or higher for a small woman or child The rate of cycling (opening and closing) of the vocal
folds in the larynx during phonation of voiced sounds is called the
fitndamental frequency or Fo This is because it sets the periodic baseline for
all higher-frequency harmonics contributed by the pharyngeal and oral resonance cavities above With an appropriate approximation, Fo correlates
well with the subjective experience of pitch (the rising and falling of voice
Trang 19"¬=
#igire 2.6 Vocal fÐld cycling at the larynx (a) Closed with sub-glottal pressure buildup;
th) trans-glattal pressure differential causing folds to blaw apart, (c} pressure equalization and fissue elasticity forcing temporary reclosure of vocal folds, ready to begin nexi cycle
Lh
The glottal cycle is illustrated in Figure 2.6 At stage (a), the vocal
folds are closed and the air stream [rom the lungs is indicated by the arrow AL
some point, the air pressure on the underside of the barrier formed by the
vocal folds increases until it overcomes the resistance of the vocal fold
closure and the higher air pressure below blows them apart (b) However, the tissues and muscles of the larynx and the vocal folds have a natural elasticity
which tends to make thom fall back into place rapidly, once air pressure is
temporarily equalized (c} ‘he successive airbursts resulting from this process are the source of energy for all voiced sounds The time for a single open-
close cycle depends on the stiffness and size of the vocal folds and the amount
of sub-glottal air pressure These factors can be controlled by a speaker to raise and lower the perceived frequency er pitch of a voiced sound
The glottal air low (volume velocity wavelorm) and the rosuling sound pressure at the mouth for a typical vowel sound is shown in Figure 2.7
The glottal waveform shows a gradual build-up to a quasi-periodic pulse train
Trang 20of air, taking about 15 ms to reach steady slate This build-up is also reflected
in the acoustic waveform shown al the bottom of the Ligure
§ cut
awd pon Bor
§ ~ ¬ai
Figure 2.7 Glotial airflow and the resulting sound pressure at the mouth [2].
Trang 21Chapter 3 AN OVERVIEW OF PROSODY
3.1 The concepts of prosody and intonation
‘The term prosody (ngén diéu) refers to certain properties of the speech
signal such as audible changes in pitch, loudness, and syllable length [3] For
some authors the set of prosodic features also includes other aspects related to
speech timing such as rhythm and speech rate
Because prosodic events appear to be time-aligned with syllables or
groups of syllables rather than with segments (phonemes), they are also
referred to as suprasegmental phenomena
The term intonation (ngir digu) is used by some as a synonym for prosody It is restricted to the tonal (melodic) aspects of prosody by others [3]
(it will be the case here, a0) It means that, in the thesis, intonation refers 1a
pitch variation in speech production and is part of prosody
3.2 Levels of representation of prosodic phenomena
As for other properlies of the speech signal, prosodic events can be
studied at various levels of representation (see ‘lable 3.1):
» First, the acoustic level: the acoustic manifestation of prosody
(fundamental frequency, amplitude, and duration) can be
measured direclly, using specialized hardware or algorithms
(such as pitch determination algorithms),
Trang 2221
w Socond, the perceptual level represents the prosodic events as
heard by the listener As lor spectral properties off speech sounds,
acoustic characteristics that can be measured are not always
percepble The perecptual representation is accessible to the individual listener, but this mental representation can hardly be
measured Alternatively it can be computed with a fair amount of
precision on the basis of our knowledge about psychoacoustics
« Finally, the linguistic level represents the prosody of an ultcrance
as a sequence of abstract units (signs, symbols), some of which
have a communicative function in speech, while others may just
fulfill syntactic requirements The linguistic structure of prosody
is not some hidden code that simply can be revealed using some standard procedure
Lable 3.1 Links between levels of representation of prosodic phenomena [3]
Fundamental frequency Pitch ‘lone, intonation, aspect of stress
(Fo)
As one moves away from acoustic level lowards the perceptual and/or linguistic levels, the measurement of some given prosodic property will
progressively involve segmentation (for example, into syllables), context
(such as relative prominence), and structural information (the linguistic
interpretation of a syllabic tone, for example, often depends on whether the related syllable is stressed or not, which requires a prior analysis of the
segmental layer)
Trang 233.3 The functions of prosody
The functions of prosody can be distinguished into those which modify
meaning and those which do not (sec Table 3.2) The former could also be
seen as the part of information which is consciously and intentionally
provided by the speaker, the message, whereas the latter involuntarily
accompanies It
Table 3.2 Information conveyed by prosody, ‘*’ marking feature discussed in this study
“4
focus intention, attitude sex
(native language, dialect)
Prosodic features have specific functions in speech communication
(One of the most uncontroversial functions of intonation is that of conveying
different illocutionary aspects, or sentence modes Thus it is commonly
maintained that a distinction between declarative and imterrogative modes 1s
one of the most universal characteristics of mtonation systems My work is
centered around the contribution of prosody to the expression of interrogalive
mode
One of the most apparent effects of prosody is that of focus Kor
instance, certain pitch events make a syllable stand out within the utterance,
and indirectly the word or syntactic group it belongs to will be highlighted as
an important or new component in the meaning of that utterance ‘The presence of a focus marking may have various effects, such as contrast,
Trang 2423
depending on the place where il occurs, or the semantic context of the ullcrance
Prosodic features creale a segmentation of the specch chain inlo groups
of syllables, or, put the other way round, they give rise to the grouping of
syllables and words into larger chunks
All these aspects of intonation can be grouped under the header of
linguistic aspects of intonation ‘hey are part of the structure of language (and
specific to any given language) in the same way as morphology and syntax
arc The linguistic features concern the way a mossage is formally coded and
organized into intonational units of a certain language ‘hey correspond to the
surface slruclure of the message on a still rather abstract level The actual
meaning of the message can oficn not be decoded without interpreting the
underlying paralinguistic information Paralinguistic information is defined as
the information that is not mlerable from the wrilicn counterpart but is
deliberately added by the speaker to modify or supplement the linguistic
information A written sentence can be uttered in various ways to express
different intentions and attitudes which arc under the conscious control of the
speaker ‘Ihe question “Are you tired?”, for instance, is simply a request for
being supplied information on someone's psychological and physiological
condition If it is asked with a concerned undertone then the message may be
“Come on, you’ve been working so hard, you have to get yourself some
sleep!” With an ironical undertone, it may mean “You lazy guy, you’ve been
sleeping all day and still you're tired!”
‘There is, however, another range of phenomena that are also expressed
by prosodic means (such as pitch), but do not modify the meaning of a
message They can convey information about the age, gender, the emotional
Trang 25or physical state of the speaker These factors are not directly related to
linguistic and paralinguistic contents of the utterances and cannot generally be
controlled by the speaker Angry people, for instance, usually have faster
pitch changes, a larger pitch range, and a larger dynamic amplitude range, whercas depressed poople typically show the apposite trend But while the
pitch range may be affected by such emotional factors, the basic functional
pitch shapes and configurations remain unaffected The emotional state docs not alter the linguistic code; it merely affects its realization ‘Ihis is why these
aspects are called non-linguistic aspects
The understanding of information conveyed by intonation is important
for intonation study Rach type af information has its cffect on tonal
variations, i.e intonation These effects need to be taken into account in
intonation analysis
3.4 Applications of intonation
Since intonation forms such a central part of human speech
communication, not only conveying diverse lmguistic mformation, but also
information about the speaker, the speaker’s mood and attitude, it certainly ought to be useful in many applications Apart from language technology and
spcoch synthesis, where intonation is an established application, divorse arcas
of medical as well as educational applications where intonation is less commonplace are being developed
Speech processing:
The increasing demand for the application of speech in man-machine
communication in all areas ranging from telephony, telematics, and
automated translation to aids for the handicapped requires sophisticated
technology for the analysis, recognilion and synthesis of speech, In the field
Trang 26of speech recognition, the more the task develops from the recognition of
single words im a limited vocabulary lowards the understanding of complex
utterances, the more suprasegmental features like intonation have to be taken
into account These arc important cues for the segmentation and classilication (question vs declaration, for instanec) of utterances In this context, modeling
intonation is an important task
In speech synthesis, modeling intonational features is indispensable for
increasing the intelligibility and naturalness of synthetic speoch Sophisticated
models of intonation are needed which predict F) contours for a particular
sentence at a given speech rate
Automatic language identification could be important especially in
different telecom applications, when the spectral content of the specch could
be expected to be distorted Intonation cues are in this case especially
interesting The varied intonational structure of languages could be cxploitcd
in ths application In this recognition task, the intonation cues need to be
combined with other types of information
Speech Pathology:
Hearing impairments, especially if they are congenital or acquired at an
carly age, arc accompanied by a reduced imiolligibility of specch Major
factors for this are an imperfect command of phonatory effort and a lack of control of the laryngeal function which result in a high degree of variation in
the pitch patterns produced The spocch of hearing impaired people may
sound monotonous or on the contrary excessively emotional The basic pitch
is often kept on a level cither too high or loo low and it was observed thal
hearing impaired persons have difficulties in changing their pitch within a
Trang 27single syllable Teaching aids have been developed to overcome these
problems which provide a feedback {or pitch over tactile or visual channels Foreign Language Education:
It is widely agreed that the acquisition of a good command of
intonational features in a foreign language is one of the most difficult tasks a
sludent must accomplish Yet it is crucial for the degree of intelligibility he or
she will achieve In traditional language education, however, intonation
usually comes second to segmental phonetics, which itself forms only a small
part in the curriculum of common language courses, This deficit has become
more apparent as the political and economical globalization requires beiter
communicative skills on the part of the leamer of a foreign language In this
conlext, individual computer-based language cducation will play a further
growing role Software is needed which is fit for the special problems of the
speaker of a language L1 who studies a larget language L2 Although a number of programs exist whereby the student can train his lexical,
grammatical or orthographic skills, there are few systems which use speech
input to help correct the student's pronunciation In this context, visualization
of speech can provide additional feedback where the auditory channel fails,
because of the mother tongue interference
It seems desirable to develop more intelligent systems which are
customized to the special requirements of students with the same native
language Contrastive studies of the intonational systems of L] and L2 can help to predict problems and select appropriate teaching materials
Trang 2827
The understandmg of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general
and on intonation analysis im particular This chapier provides a review of
characteristics of Vietnamese language and some works of other authors
related to my study
4.1 General characteristics of Vietnamese language
Vietnamese is known as a tonal language which uses tone to distinguish lexical meaning Vietnamese has basically six lexical tones Each tone could
contribute to create the morpheme and meaning of word, e.g me, mé, mé, ma,
mé, me It is not the case for non-tonal languages In English, for example, the position of the stressed syllable within a word is lexically distinctive
4.1.1 Phoneme system
Viemamese phoneme system includes 14 vowels or vowel
combinations and 22 consonants
The Vietnamese vowels include cleven vowels and three diphthongs [5]
(see Table 4.1) All vowels are voiced sounds
Trang 29Table 4.1 Vietnamese vowels
Alef 1a, yề ia, lê, ya,về |_ kia kia, yêu kiểu
An o/ va va, uỗ | tua rua, luôn luôn decd ®ị ưa ưa, ươ lưa thưa, lượt thượt
Vietnamese includes 22 consonants [5]:
Trang 3029
Table 4.2 Vietnamese consonants
fa dé, gié and do d, gi duyén dang, gitt gin
Trang 31of consonant in syllable Based on these features, Vietnamese consonants can
~~~ arliculutc position - apical
articulate method ~~~ _ dental_laminal
Viemamese grammarians and linguists have long considered the
syllable in Vietnamese as a fundamental unit A syllable in full structure (a tonal syllable) has five parts: initial sound, medial sound, nucleus sound, final
sound and lone (Table 4.4) [5] For instance, the syllable “ton” has following
components: initial sound /t/, medial sound /o/, nucleus sound /a/, final sound
‘n/, and tone “sac” (or rising tone) One syllable has to have a nucleus sound
Other components arc optional A nucleus sound could create onc syllable, for
instance a, 6, & Besides the initial sound (called INITLAL part), the rest of
the syllable is called a FINAL part A tone is a fumdamental frequency
variation spreading over the whole syllable A lone has the same function as 4
phoneme It always assigns for syllable and its influence covers the entire of
syllable Therc are a low constraints: 1Í a syllable ends with unvoiced
Trang 324.1.3 The tonal system
There are six syllabic tones in Vietnamese (sce Table 4.5) To deseribe
the tonal system on a physical basis, most linguists have studied tones in
isolated syllables where they are likely to be realized as close as possible
according 1o ther phonolype In term of distinctive [calures, Victnamose
tones can be described according to register, contour and glottalization (the
complete or partial closure of the gloltis during the articulation of another
sound) ‘lhese tones can be separated into two groups according to register:
“ngang”, “sắc”, “ngã” are realized in a higher register while “huyền”, “nặng”,
“hoi” are realized in a lowcr one Bascd on glotalizaLion [caturc, thesc six
tones can be classified into two groups: “ngã” and “nang” tones are
meg
glottalized whereas “ngang”, “huyén”, , “hỏi” are non-glottalized The
FO contours of the six Victnamese tones (cxamples are shown in Figure 4.1),
are described as follows [6]:
Table 4.5 The six Vietnamese tones,
Tone 1 | Tone2 | Tone 3 | Tone 4 | Tone 5 | Tone 6
Trang 33
Figure 4.1 Example of the contours of six tones (female subject PNY), as described in [7]
e Tone 1- Level tone (“ngang”): is a high tone At the beginning of syllable, it is the highest tone The steady state of the level contour is observed consistently
* Tone 2 - Falling tone (“huyén”): the onset of the falling tone is lower than tone 1, tone 5 and tone 3 The low FO at the onset gradually falls toward the end
e Tone 3 - Broken tone (“nga”): the onset is as high as that of the level of tone 5, it is higher than the falling tone The second third
of the contour of this tone is characterized by an abrupt dip
caused by a glottalization In most cases, the bottom of the dip
occurs between the mid-point and the point two-thirds from
onset A creaky voice is heard during this dip
e Tone 4 - Curve tone (“hỏi”): the onset is the lowest among the six tones The low onset falls further gradually until the point two-thirds from the onset From this point, the extremely low FO starts to rise toward the end.
Trang 3433
© Tone 5 - Rising tone (“sac”): the onset is also high Starting from
high onset, the FO gradually rises for the first two thirds of the
duration After this point, the rise becomes more rapid
« ‘Tone 6 - Drop tone (“nang”): the onset is usually higher than that
of the falling or curve tone but considerably lower than the tonc
1, tone 5 and tone 3 This tone is characterized by a glottahzation
at the end and also by its considerably shorter duration than the
other tones The duration of this tone is approximately two thirds
of the other tones The main body of this tone is almost leveled
or slightly falling
These descriptions are only for the Northern dialect, in particular Tlanoi
dialect which is the standard dialect of Vietnamese They would be changed
with the other dialects in the South and the Center of Vietnam In these
regions, there arc only S$ tones mstcad of 6 like the Hanoi dialcet, because
tone 3 and tone 4 are pronounced identically
4.1.4, Tones in context
In continuous speech, tones seldom reach their target values They are
generally affected by context: stressed vs unstressed syllable, influence of
neighbouring tones, tempo These influences have rarely been studied
Tonal variation due to the influence of neighbouring tones is described by
linguists as a type of tonal coarticulation Dé ‘Ihé Ding [8] observed that after
arising lone such as “sic” or “nga”, any immediately following tone will start
one or two quarter tones higher than its normal target valuc, and aller a [alling
tone such as “nang” or “huyền” it will start one or two quarter tones lower
This variation is stronger in unstressed positions than in stressed ones, and in
spite of this, a relative difforenee in rogister and contour 1s preserved.
Trang 354.1.5 Modality, attitude and morphosyntactic structures
In Vietnamese there are two possible ways of expressing modality,
mood or altitude, the first only using prosodic features, and the sccond using
lexicon-syntactic markers, possibly combined with prosodic features [8] In
the first casc, as the pragmatic informalon relies entirely on prosodic
structure, it has to be clearly marked In the second case, as intonation become redundant, it is interesting to see if it can still play a role in characterizing the
pragmatic type
The Victnamese language has a sysiom of synlaclic markers which
occur mostly at the end (occasionally at the beginning or in the middle) of a
declarative sentence They are used lo express modal and attitudinal
meanings For example, trom a declaralive sence
Troi mua
‘we may obtain a-yes-no question by adding “không”
Trời mưa không?
With another morpheme “A”, we obtain a question expressing the
speaker’s surprise:
‘Troi mua 4?
‘The morphosyntactic elements can be put into three classes according
to their semantic values: question, imperative and attitudinal markers
Trang 36surprise or astonishment and cannol be considered a “neutral” interrogative
type [8] Some controversies remain about the classification of interrogative
markers It seems, however, reasonable to distinguish two types of question
Yes-no questions use the following markers: “khéng” expresses a
question on the predicative relation itself, for mstance “Troi mua khéng?”,
“chưa” has an aspectual value, for example “‘Iroi mua chua?”: “hay” give an
explicit altemative choice, for example “Trời mưa hay trời nắng?
Open questions use indefinite words in the same way as wh-markers: ai
(who), bao gir (when), bao lau (how long), bao nhiGu (how many), bao xã
chow far}, dau (where), gi (what), mấy (hơw many), mây gid (at what time),
nig (which or what), nhu thé ndo (how), sao/tai sao/vi sao (why), sao/lam sao
(how)
Some linguists have also mentioned a third type of question called
biased questions (suggesting an expected answer) which are associated with
the expression of an allitude They are syntactically marked with the final
morphemes “a, uw” (surprise), “chứ” (logical evidence), “ha, hit, ho” Gnsisting
and astonishment), “nhé” (supposition, suggestion),
4.1.5.2 Injunctive markers
Injunction is expressed by the presence of “di” at the end of a
declarative structure, for instance “Trời mưa đit”
A weaker injunction is expressed with “nhé” and a stronger (insisting)
is exprossed with the compound marker “hiy di”
4.1.5.3, Attitude and emotional markers
In Vietnamese, a final marker can be used to express speaker altitude
Lé Thi Xuyén gave the following list: “a” (respect), “d4y” (admiration), “ri”
Trang 37(conclusive), “ma” (insistence), “sao” (surprise), “ching” (doubt), “ha”
(anger), “nhi” (Jamiliarily), “vay” (cxternal obligation) [8 |
4.2 Some studies on Vietnamese prosody
In a tonal language like Vietnamese with six lexical tones which
moreaver has a system of morphosyntactic markers to express emotions,
attitudes, mood and modality, it would not be surprising if intonation play a lesser role than in non-tonal languages such as French or English: what is
usually conveyed by intonation in many other languages is already marked
This idea was developed by Gordina and Bystrov: “the more a language uses
xmorphosynlaotle or syntactic means ta express mood, modalily and emotions,
the less it would rely on intonation for the same functions” [8]
‘This explains why there are very few studies on intonation in
Vietnamese There are a few remarks in general grammar books The
statements about intonation made by grammarians or linguists are rather
intuitive, not based on experimental description For example, declarative sentences are said to be “falling” with such descriptive lerms as “fading” or
“decreasing” (Thompson), “falling” (Nguyén Sng 1.8m), “normal” or “low
pitch” (Jones and Huynh Sanh Théng), whereas interrogative sentences are
said to bơ “rising” (Nguyễn Đăng Liêm), “sustaming” (Thompson), “higher
pitch level 1” (Jones and Huỳnh Sanh Lhông) Expressive seniences on the other hand are said to have a rising contour with a higher pitch level: “higher
pitch level 2 or 3” (Jones and Huynh Sanh Thing), “increasing” (Thompson),
“tising-falling” (Nguyén Dang Liém) [8]
There are a small number of experimental studies by Gordina and
Bystrov, Lê Thị Xuyến, Nguyễn and Boulakia |9| These studies have already
given some ideas of the role and function of intonation in Vietnamese.
Trang 3837
According to Gordina and Bystrov, the shorter the sentence, the greater
the difference between the intonation patterns [8] In their examples
(a) Anh ấy di sang nude Anh a?
(b) Anh ấy di sang nước Anh
(e) Không sách ả?
(đ) Không sách
the diferenec is greaLer beLwcen the (c) and (đ) patlcrns than between
{a) and (b) though in each case a declarative is contrasted with an
intorrogalive
According lo these same authors, an interrogative without a
morphosyntactic marker has a well differentiated pattern when compared to
an interrogative with a marker In:
(e) Khéng sach a?
(f) Khéng sach?
(g) Khang sach
The difference between the intonation patterns of (g) and (f) is greater
than between (g) and (c)
Lê Thị Xuyến seL out to establish whether six different attiludinal
sentence types (statement, irony, exasperation, anger, sadness and admiration} were differentiated on the prosodic level [8] Her corpus consists of five
sentences as follows
“Mua.”
“Trời mưa.”
“Cô ta xmh.”
Trang 39“Khuya rồi"
“Nam về lúc khoảng một rưỡi `
Lach sentence was read with different attitudes by 2 speakers (one
male, one female) and judged by 20 hears Results of her experiments showed
that only irony anger and statement were identified above chance level (75%,
52.5%, 67.5% respectively) According to her, the neutral declarative is
characterized by a low register and a moderate tempo; irony has a higher
register, a larger tone movement and a slower tempo resulting in increased
gonlonce length; whercas anger is conveycd by a spoeding up of tempo,
greater and more abrupt pitch movement, shortening of the utterance and an
increase in the overall intensity
In order lo bring oul the pertinent prosodic features corresponding Lo
assertive, mlorrogative and umperative modes, while cxcluding altitudinal
variations, and to produce natural Vietnamese utterances, Neuyén and
Boulakia [9| uscd a cerlain number of ullcrance pairs in which the final
question or imperative marker can be replaced by a homonymous lexical item
The resulting pairs have the same syllabic and tonal structures but differing
morphosynlactic structurcs They arc therclore considered to be ambiguous
and if they are discriminated, it has to be due to the presence of prosodic differences Some example pairs of sentences in their corpus are as follows:
Statement Question:
Lan thích 4n com khéng (Lan only likes to eat rice.)
Lan thích ấn cơm không? (Does Lan like to eat rice?) Statement — Imperative:
Trang 40Tan bé di chit? (TAn, did you leave?)
"Tân bỏ di chit! (Tân, do leave it!)
From five morphemes “khéng”, “ha”, “sao”, “chur”, “di”, mine such
pairs of sentences were formulated These 18 sentences were read by 4
speakers (2 males and 2 females) and judged by 22 hearers, The resulls of
prosodic analysis showed that questions arc shorter than statements and this
difference is significant Imperatives are even shorter but the difference with
questions is not significant, In terms of intensity, the difference is signilicant
for the statement-imperative pair, but not for the statement-question and
question-imperative ones About the intonation, the two members of the same
pair have the same overall FO contour but there is a difference in terms of
register: the register of questions and imperatives is clearly higher than that of statements, while there is no difference between questions and imperatives
(Figure 4.2) ‘There is an obvious difference in the last syllable: the “ngang”
tone falls in statements and is much higher and rising in questions, while the
moan value and movement is half way between for imperatives The rising
tones, “sic” or “héi”, rise even more in the case of questions than in
statements, while tend to become flat or even fall slightly in the final parl in
imperatives It means that there is an influence of the intonation on ihe fmal-
syllable tone of the sentence