Segment analysis: a source waveform segment; b estimated fundamental frequency contour; c estimated harmonic amplitudes; d estimated stochastic part; e spectrogram of the source segment;
Trang 1b)
c) Fig 8 Harmonic parameters estimation: a) source signal; b) estimated deterministic part; c) estimated stochastic part
An example of harmonic analysis is presented in Figure 8(a) The source signal is a phrase uttered by a male speaker (ܨ௦ൌ ͺkHz) The deterministic part of the signal Figure 8(b) was synthesized using estimated harmonic parameters and subtracted from the source in order
to get the stochastic part Figure 9(c) The spectrograms show that all steady harmonics of the source are modelled by sinusoidal representation when the residual part contains transient and noise components
7.2 Harmonic analysis in TTS systems
This subsection presents an experimental application of sinusoidal modelling with proposed analysis techniques to a TTS system Despite the fact that many different techniques have been proposed, segment concatenation is still the major approach to speech synthesis The speech segments (allophones) are assembled into synthetic speech and this process involves time-scale and pitch-scale modifications in order to produce natural-like sounds The concatenation can be carried out either in time or frequency domain Most time domain techniques are similar to the Pitch-Synchronous Overlap and Add method (PSOLA) (Moulines and Charpentier, 1990) The speech waveform is separated into short-time signals
by the analysis pitch-marks (that are defined by the source pitch contour) and then processed and joined by the synthesis pitch-marks (that are defined by the target pitch contour) The process requires accurate pitch estimation of the source waveform Placing
Fig 7 Frame analysis by autocorrelation and sinusoidal parameters conversion: a)
autocorrelation spectrum estimation; b) autocorrelation residual; c) instantaneous LPC
spectrum; d) instantaneous residual
7 Experimental applications
The described methods of sinusoidal and harmonic analysis can be used in several speech
processing systems This section presents some application results
7.1 Application of harmonic analysis to parametric speech coding
Accurate estimation of sinusoidal parameters can significantly improve performance of
coding systems Well-known compressing algorithms that use sinusoidal representation
may benefit from fine accurate harmonic/residual separation, providing higher quality of
the decoded signal The described analysis technique has been applied to hybrid speech and
audio coding (Petrovsky et al., 2008)
a)
Trang 2e) f) Fig 9 Segment analysis: a) source waveform segment; b) estimated fundamental frequency contour; c) estimated harmonic amplitudes; d) estimated stochastic part; e) spectrogram of the source segment; f) spectrogram of the stochastic part
The periodical signal ����� with pitch shifting can be synthesized from its parametric representation as follows:
In synthesis process the phase differences ������ are good substitutions of phase parameters
����� since all the harmonics are kept coordinated regardless of the frequency contour and the initial phase of the fundamental
Due to parametric representation spectral amplitude and phase mismatches at segments borders can be efficiently smoothed Spectral amplitudes of acoustically related sounds can
be matched by simultaneous fading out and in that is equivalent to linear spectral smoothing (Dutoit 1997) Phase discontinuities are also can be matched by linear laws taking into account that harmonic components are represented by their relative phases
������ However, large discontinuities (when absolute difference exceeds �) should be eliminated by adding multiplies of ��� to the phase parameters of the next segment Thus, phase parameters are smoothed in the same way as spectral amplitudes, providing imperceptible concatenation of the segments
In Figure 10 the proposed approach is compared with PSOLA synthesis, implemented as described in (Moulines and Charpentier, 1990) A fragment of speech in Russian was synthesized through two different techniques using the same source acoustic database The
analysis pitch-marks is an important stage that significantly affects synthesis quality
Frequency domain (parametric) techniques deal with frequency representations of the
segments instead of their waveforms what requires prior transformation of the acoustic
database to frequency domain Harmonic modelling can be especially useful in TTS systems
for the following reasons:
- explicit control over pitch, tempo and timbre of the speech segments that insures
proper prosody matching ;
- high-quality segment concatenation can be performed using simple linear
smoothing laws;
- acoustic database can be highly compressed;
- synthesis can be implemented with low computational complexity
In order to perform real-time synthesis in harmonic domain all waveform speech segments
should be analysed and stored in new database, which contains estimated harmonic
parameters and waveforms of stochastic signals The analysis technique described in the
chapter can be used for parameterization In Figure 9 a result of such parameterization is
presented The analysed segment is sound [a:] of a female voice
Speech concatenation with prosody matching can be efficiently implemented using
sinusoidal modelling In order to modify durations of the segments the harmonic
parameters are recalculated at new instants, that are defined by some dynamic warping
function, the noise part is parameterized by spectral envelopes and then time-scaled as
described in (Levine and Smith, 1998)
Changing the pitch of a segment requires recalculation of harmonic amplitudes, maintaining
the original spectral envelope Noise part of the segment is not affected by pitch shifting and
obviously should remain untouched Let us consider the instantaneous frequency envelope
as a function ܧሺ݊ǡ ݂ሻ of two parameters (sample number and frequency respectively) After
harmonic parameterization the function is defined at frequencies of the harmonic
components that were calculated at the respective instants of time: ܧ൫݊ǡ ݂ ሺ݊ሻ
In order to get the completely defined function the piecewise-linear interpolation is used
Such interpolation has low computational complexity and, at the same time, gives
sufficiently good approximation (Dutoit 1997)
Trang 3e) f) Fig 9 Segment analysis: a) source waveform segment; b) estimated fundamental frequency contour; c) estimated harmonic amplitudes; d) estimated stochastic part; e) spectrogram of the source segment; f) spectrogram of the stochastic part
The periodical signal ����� with pitch shifting can be synthesized from its parametric representation as follows:
In synthesis process the phase differences ������ are good substitutions of phase parameters
����� since all the harmonics are kept coordinated regardless of the frequency contour and the initial phase of the fundamental
Due to parametric representation spectral amplitude and phase mismatches at segments borders can be efficiently smoothed Spectral amplitudes of acoustically related sounds can
be matched by simultaneous fading out and in that is equivalent to linear spectral smoothing (Dutoit 1997) Phase discontinuities are also can be matched by linear laws taking into account that harmonic components are represented by their relative phases
������ However, large discontinuities (when absolute difference exceeds �) should be eliminated by adding multiplies of ��� to the phase parameters of the next segment Thus, phase parameters are smoothed in the same way as spectral amplitudes, providing imperceptible concatenation of the segments
In Figure 10 the proposed approach is compared with PSOLA synthesis, implemented as described in (Moulines and Charpentier, 1990) A fragment of speech in Russian was synthesized through two different techniques using the same source acoustic database The
analysis pitch-marks is an important stage that significantly affects synthesis quality
Frequency domain (parametric) techniques deal with frequency representations of the
segments instead of their waveforms what requires prior transformation of the acoustic
database to frequency domain Harmonic modelling can be especially useful in TTS systems
for the following reasons:
- explicit control over pitch, tempo and timbre of the speech segments that insures
proper prosody matching ;
- high-quality segment concatenation can be performed using simple linear
smoothing laws;
- acoustic database can be highly compressed;
- synthesis can be implemented with low computational complexity
In order to perform real-time synthesis in harmonic domain all waveform speech segments
should be analysed and stored in new database, which contains estimated harmonic
parameters and waveforms of stochastic signals The analysis technique described in the
chapter can be used for parameterization In Figure 9 a result of such parameterization is
presented The analysed segment is sound [a:] of a female voice
Speech concatenation with prosody matching can be efficiently implemented using
sinusoidal modelling In order to modify durations of the segments the harmonic
parameters are recalculated at new instants, that are defined by some dynamic warping
function, the noise part is parameterized by spectral envelopes and then time-scaled as
described in (Levine and Smith, 1998)
Changing the pitch of a segment requires recalculation of harmonic amplitudes, maintaining
the original spectral envelope Noise part of the segment is not affected by pitch shifting and
obviously should remain untouched Let us consider the instantaneous frequency envelope
as a function ܧሺ݊ǡ ݂ሻ of two parameters (sample number and frequency respectively) After
harmonic parameterization the function is defined at frequencies of the harmonic
components that were calculated at the respective instants of time: ܧ൫݊ǡ ݂ ሺ݊ሻ
In order to get the completely defined function the piecewise-linear interpolation is used
Such interpolation has low computational complexity and, at the same time, gives
sufficiently good approximation (Dutoit 1997)
Trang 4The autocorrelation analysis was carried out with analysis frame 512 samples in length, weighted by the Hamming window Prediction order was 20 in both cases
8 Conclusions
An estimation technique of instantaneous sinusoidal parameters has been presented in the chapter The technique is based on narrow-band filtering and can be applied to audio and speech sounds Signals with harmonic structure (such as voiced speech) can be analysed using frequency-modulated filters with adjustable impulse response The technique has a good performance considering that accurate estimation is possible even in case of rapid frequency modulations of pitch A method of pitch detection and estimation has been described as well The use of filters with modulated impulse response, however, requires precise estimation of instantaneous pitch that can be achieved through pitch values recalculation during the analysis process The main disadvantage of the method is high computational cost in comparison with STFT
Some experimental applications of the proposed approach have been illustrated The sinusoidal modelling based on the presented technique has been applied to speech coding, and TTS synthesis with wholly satisfactory results
The sinusoidal model can be used for estimation of LPC parameters that describe instantaneous behaviour of the periodical signal The presented conversion technique of sinusoidal parameters into prediction coefficients provides high energy localization and smaller residual for frequency-modulated signals, however overall performance entirely depends on the quality of prior sinusoidal analysis The instantaneous prediction
database segments were picked out from the speech of a female speaker The sound sample
in Figure 10(a) is the result of the PSOLA method
Fig 10 TTS synthesis comparison: a) PSOLA synthesis; b) harmonic domain concatenation
In Figure 10(b) the sound sample is shown, that is the result of the described
analysis/synthesis approach In order to get the parametric representation of the acoustic
database each segment was classified either as voiced or unvoiced The unvoiced segments
were left untouched while the voiced were analyzed by the technique described in Section 4,
then prosody modifications and segment concatenation were carried out Both sound
samples were synthesized at 22kHz, using the same predefined pitch contour
As can be noticed from the presented samples the time domain concatenation approach
produces audible artefacts at segment borders They are caused by phase and pitch
mismatching, that cannot be effectively avoided during synthesis The described parametric
approach provides almost inaudible phase and pitch smoothing, without distorting spectral
and formant structure of the segments The experiments have shown that this technique is
good enough even for short and fricative segments, however, the short Russian ‘r’ required
special adjustment of the filter parameters at the analysis stage in order to make proper
analysis of the segment
The main drawback of the described approach is noise amplification immediately at
segment borders where the analysis filter gives less accurate results because of spectral
leakage In the current experiment the problem was solved by fading out the estimated
noise part at segment borders It is also possible to pick out longer segments at the database
preparation stage and then shorten them after parameterization
7.3 Instantaneous LPC analysis of speech
LPC-based techniques are widely used for formant tracking in speech applications Making
harmonic analysis first and then performing parameters conversion a higher accuracy of
formant frequencies estimation can be achieved In Figure 11 a result of voiced speech
analysis is presented The analysed signal (Figure 11(a)) is a vowel [a:] uttered by a male
speaker This sound was sampled at 8kHz and analyzed by the autocorrelation (Figure
11(b)) and the harmonic conversion (Figure 11(c)) techniques In order to give expressive
pictures prediction coefficients were updated for every sample of the signal in both cases
Trang 5The autocorrelation analysis was carried out with analysis frame 512 samples in length, weighted by the Hamming window Prediction order was 20 in both cases
8 Conclusions
An estimation technique of instantaneous sinusoidal parameters has been presented in the chapter The technique is based on narrow-band filtering and can be applied to audio and speech sounds Signals with harmonic structure (such as voiced speech) can be analysed using frequency-modulated filters with adjustable impulse response The technique has a good performance considering that accurate estimation is possible even in case of rapid frequency modulations of pitch A method of pitch detection and estimation has been described as well The use of filters with modulated impulse response, however, requires precise estimation of instantaneous pitch that can be achieved through pitch values recalculation during the analysis process The main disadvantage of the method is high computational cost in comparison with STFT
Some experimental applications of the proposed approach have been illustrated The sinusoidal modelling based on the presented technique has been applied to speech coding, and TTS synthesis with wholly satisfactory results
The sinusoidal model can be used for estimation of LPC parameters that describe instantaneous behaviour of the periodical signal The presented conversion technique of sinusoidal parameters into prediction coefficients provides high energy localization and smaller residual for frequency-modulated signals, however overall performance entirely depends on the quality of prior sinusoidal analysis The instantaneous prediction
database segments were picked out from the speech of a female speaker The sound sample
in Figure 10(a) is the result of the PSOLA method
Fig 10 TTS synthesis comparison: a) PSOLA synthesis; b) harmonic domain concatenation
In Figure 10(b) the sound sample is shown, that is the result of the described
analysis/synthesis approach In order to get the parametric representation of the acoustic
database each segment was classified either as voiced or unvoiced The unvoiced segments
were left untouched while the voiced were analyzed by the technique described in Section 4,
then prosody modifications and segment concatenation were carried out Both sound
samples were synthesized at 22kHz, using the same predefined pitch contour
As can be noticed from the presented samples the time domain concatenation approach
produces audible artefacts at segment borders They are caused by phase and pitch
mismatching, that cannot be effectively avoided during synthesis The described parametric
approach provides almost inaudible phase and pitch smoothing, without distorting spectral
and formant structure of the segments The experiments have shown that this technique is
good enough even for short and fricative segments, however, the short Russian ‘r’ required
special adjustment of the filter parameters at the analysis stage in order to make proper
analysis of the segment
The main drawback of the described approach is noise amplification immediately at
segment borders where the analysis filter gives less accurate results because of spectral
leakage In the current experiment the problem was solved by fading out the estimated
noise part at segment borders It is also possible to pick out longer segments at the database
preparation stage and then shorten them after parameterization
7.3 Instantaneous LPC analysis of speech
LPC-based techniques are widely used for formant tracking in speech applications Making
harmonic analysis first and then performing parameters conversion a higher accuracy of
formant frequencies estimation can be achieved In Figure 11 a result of voiced speech
analysis is presented The analysed signal (Figure 11(a)) is a vowel [a:] uttered by a male
speaker This sound was sampled at 8kHz and analyzed by the autocorrelation (Figure
11(b)) and the harmonic conversion (Figure 11(c)) techniques In order to give expressive
pictures prediction coefficients were updated for every sample of the signal in both cases
Trang 6McAulay, R J & Quateri T F (1992) The sinusoidal transform coder at 2400 b/s,
Proceedings of Military Communications Conference, Calif, USA, October 1992, San
Diego
Moulines, E & Charpentier, F (1990) Pitch Synchronous Waveform Processing Techniques
for Text-to-Speech Synthesis Using Diphones Speech Communication, Vol.9, No 5-6,
(1990) 453-467
Painter, T & Spanias, A (2003) Sinusoidal Analysis-Synthesis of Audio Using Perceptual
Criteria EURASIP Journal on Applied Signal Processing, No l, (2003) 15-20
Petrovsky, A.; Stankevich, A & Balunowski, J (1999) The order tracking front-end
algorithms in the rotating machine monitoring systems based on the new digital
low order tracking, Proc of the 6th Intern Congress “On sound and vibration”,
pp.2985-2992, Denmark, 1999, Copenhagen
Petrovsky, A.; Azarov, E & Petrovsky, A (2008) Harmonic representation and auditory
model-based parametric matching and its application in speech/audio analysis,
AES 126th Convention, Preprint 7705, Munich, Germany
Rabiner, L & Juang, B.H (1993) Fundamentals of speech recognition, Prentice Hall, New
Jersey
Serra, X (1989) A system for sound analysis/transformation/synthesis based on a
deterministic plus stochastic decomposition, Ph.D thesis, Stanford University,
Stanford, Calif, USA
Spanias, A.S (1994) Speech coding: a tutorial review Proc of the IEEE, Vol 82, No 10, (1994)
1541-1582
Weruaga, L & Kepesi, M (2007) The fan-chirp transform for non-stationary harmonic
signals, Signal Processing, Vol 87, issue 6, (June 2007) 1-18
Zhang, F.; Bi, G & Chen Y.Q (2004) Harmonic transform, IEEE Proc.-Vis Image Signal
Process., Vol 151, No 4, (August 2004) 257-264
coefficients allow implementing fine formant tracking that can be useful in such applications
as speaker identification and speech recognition
Future work is aimed at further investigation of the analysis filters and their behaviour,
finding optimized solutions for evaluation of sinusoidal parameters It might be some
potential in adapting described methods to other applications such as vibration analyzer of
mechanical devices and diagnostics of throat diseases
9 Acknowledgments
This work was supported by the Belarusian republican fund for fundamental research
under the grant T08MC-040 and the Belarusian Ministry of Education under the grant
09-3102
10 References
Abe, T.; Kobayashi, T & Imai, S (1995) Harmonics tracking and pitch extraction based on
instantaneous frequency, Proceedings of ICASSP 1995 pp 756–759 1995
Azarov, E.; Petrovsky, A & Parfieniuk, M (2008) Estimation of the instantaneous harmonic
parameters of speech, Proceedings of the 16th European Signal Process Conf
(EUSIPCO-2008), CD-ROM, Lausanne, 2008
Boashash, B (1992) Estimating and interpreting the instantaneous frequency of a signal,
Proceedings of the IEEE, Vol 80, No 4, (1992) 520-568
Dutoit, T (1997) An Introduction to Text-to-speech Synthesis, Kluwer Academic Publishers, the
Netherlands
Gabor, D (1946) Theory of communication, Proc IEE, Vol.93, No 3, (1946) 429-457
Gianfelici, F.; Biagetti, G.; Crippa, P & Turchetti, C (2007) Multicomponent AM–FM
Representations: An Asymptotically Exact Approach, IEEE Transactions on Audio,
Speech, and Language Processing, Vol 15, No 3, (March 2007) 823-837
Griffin, D & Lim, J (1988) Multiband excitation vocoder, IEEE Trans On Acoustics, Speech
and Signal Processing, Vol 36, No 8, (1988) 1223-1235
Hahn, S L (1996) Hilbert Transforms in Signal Processing, MA: Artech House, Boston
Huang, X; Acero, A & Hon H.W (2001) Spoken language processing, Prentice Hall, New
Jersey
Levine, S & Smith, J (1998) A Sines+Transients+Noise Audio Representation for Data
Compression and Time/Pitch Scale Modifications, AES 105th Convention, Preprint
4781, San Francisco, CA, USA
Maragos, P.; Kaiser, J F & Quatieri, T F (1993) Energy Separation in Signal Modulations
with Application to Speech Analysis”, IEEE Trans On Signal Process., Vol 41, No
10, (1993) 3024-3051
Markel J.D & Gray A.H (1976) Linear prediction of speech, Springer-Verlag Berlin Heidelberg,
New York
McAulay, R J & Quatieri, T F (1986) Speech analysis/synthesis based on a sinusoidal
representation IEEE Trans On Acoustics, Speech and Signal Process., Vol 34, No 4,
(1986) 744-754
Trang 7McAulay, R J & Quateri T F (1992) The sinusoidal transform coder at 2400 b/s,
Proceedings of Military Communications Conference, Calif, USA, October 1992, San
Diego
Moulines, E & Charpentier, F (1990) Pitch Synchronous Waveform Processing Techniques
for Text-to-Speech Synthesis Using Diphones Speech Communication, Vol.9, No 5-6,
(1990) 453-467
Painter, T & Spanias, A (2003) Sinusoidal Analysis-Synthesis of Audio Using Perceptual
Criteria EURASIP Journal on Applied Signal Processing, No l, (2003) 15-20
Petrovsky, A.; Stankevich, A & Balunowski, J (1999) The order tracking front-end
algorithms in the rotating machine monitoring systems based on the new digital
low order tracking, Proc of the 6th Intern Congress “On sound and vibration”,
pp.2985-2992, Denmark, 1999, Copenhagen
Petrovsky, A.; Azarov, E & Petrovsky, A (2008) Harmonic representation and auditory
model-based parametric matching and its application in speech/audio analysis,
AES 126th Convention, Preprint 7705, Munich, Germany
Rabiner, L & Juang, B.H (1993) Fundamentals of speech recognition, Prentice Hall, New
Jersey
Serra, X (1989) A system for sound analysis/transformation/synthesis based on a
deterministic plus stochastic decomposition, Ph.D thesis, Stanford University,
Stanford, Calif, USA
Spanias, A.S (1994) Speech coding: a tutorial review Proc of the IEEE, Vol 82, No 10, (1994)
1541-1582
Weruaga, L & Kepesi, M (2007) The fan-chirp transform for non-stationary harmonic
signals, Signal Processing, Vol 87, issue 6, (June 2007) 1-18
Zhang, F.; Bi, G & Chen Y.Q (2004) Harmonic transform, IEEE Proc.-Vis Image Signal
Process., Vol 151, No 4, (August 2004) 257-264
coefficients allow implementing fine formant tracking that can be useful in such applications
as speaker identification and speech recognition
Future work is aimed at further investigation of the analysis filters and their behaviour,
finding optimized solutions for evaluation of sinusoidal parameters It might be some
potential in adapting described methods to other applications such as vibration analyzer of
mechanical devices and diagnostics of throat diseases
9 Acknowledgments
This work was supported by the Belarusian republican fund for fundamental research
under the grant T08MC-040 and the Belarusian Ministry of Education under the grant
09-3102
10 References
Abe, T.; Kobayashi, T & Imai, S (1995) Harmonics tracking and pitch extraction based on
instantaneous frequency, Proceedings of ICASSP 1995 pp 756–759 1995
Azarov, E.; Petrovsky, A & Parfieniuk, M (2008) Estimation of the instantaneous harmonic
parameters of speech, Proceedings of the 16th European Signal Process Conf
(EUSIPCO-2008), CD-ROM, Lausanne, 2008
Boashash, B (1992) Estimating and interpreting the instantaneous frequency of a signal,
Proceedings of the IEEE, Vol 80, No 4, (1992) 520-568
Dutoit, T (1997) An Introduction to Text-to-speech Synthesis, Kluwer Academic Publishers, the
Netherlands
Gabor, D (1946) Theory of communication, Proc IEE, Vol.93, No 3, (1946) 429-457
Gianfelici, F.; Biagetti, G.; Crippa, P & Turchetti, C (2007) Multicomponent AM–FM
Representations: An Asymptotically Exact Approach, IEEE Transactions on Audio,
Speech, and Language Processing, Vol 15, No 3, (March 2007) 823-837
Griffin, D & Lim, J (1988) Multiband excitation vocoder, IEEE Trans On Acoustics, Speech
and Signal Processing, Vol 36, No 8, (1988) 1223-1235
Hahn, S L (1996) Hilbert Transforms in Signal Processing, MA: Artech House, Boston
Huang, X; Acero, A & Hon H.W (2001) Spoken language processing, Prentice Hall, New
Jersey
Levine, S & Smith, J (1998) A Sines+Transients+Noise Audio Representation for Data
Compression and Time/Pitch Scale Modifications, AES 105th Convention, Preprint
4781, San Francisco, CA, USA
Maragos, P.; Kaiser, J F & Quatieri, T F (1993) Energy Separation in Signal Modulations
with Application to Speech Analysis”, IEEE Trans On Signal Process., Vol 41, No
10, (1993) 3024-3051
Markel J.D & Gray A.H (1976) Linear prediction of speech, Springer-Verlag Berlin Heidelberg,
New York
McAulay, R J & Quatieri, T F (1986) Speech analysis/synthesis based on a sinusoidal
representation IEEE Trans On Acoustics, Speech and Signal Process., Vol 34, No 4,
(1986) 744-754
Trang 9Music Structure Analysis Statistics for Popular Songs
Namunu C Maddage, Li Haizhou and Mohan S Kankanhalli
X
Music Structure Analysis Statistics
for Popular Songs
School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology
(RMIT) University, Swanston Street, Melbourne, 3000, Australia
1Dept of Human Language Technology, Institute for Infocomm Research,
1 Fusionopolis Way, Singapore 138632
2School of Computing, National University of Singapore, Singapore, 117417
Abstract
In this chapter, we have proposed a better procedure for manual annotation of music
information The proposed annotation procedure involves carrying out listening tests and
then incorporating music knowledge to iteratively refine the detected music information
Using this annotation technique, we can effectively compute the durations of the music
notes, time-stamp the music regions, i.e pure instrumental, pure vocal, instrumental mixed
vocals and silence, and annotate the semantic music clusters (components in a song
structure), i.e Verse -V, Chorus - C, Bridge -B, Intro, Outro and Middle-eighth
From the annotated information, we have further derived the statistics of music structure
information We conducted experiments on 420 popular songs which were sung in English,
Chinese, Indonesian and German languages We assumed a constant tempo throughout the
song and meter to be 4/4 Statistical analysis revealed that 62.46%, 35.48%, 1.87% and 0.17%
of the contents in a song belong to instrumental mixed vocal, pure instrumental, silence and
pure vocal music regions We also found over 70% of English and Indonesian songs and 30%
of Chinese songs used V-C-V-C and V-V-C-V-C song structures respectively, where V and C
denote the verse and chorus respectively It is also found that 51% of English songs, 37% of
Chinese songs, and 35% of Indonesian songs used 8 bar duration in both chorus and verse
1 Introduction
Music is a universal language people use for sharing their feelings and sensations Thus
there have been keen research interests not only to understand how music information
stimulates our minds, but also to develop applications based on music information For
example, vocal and non-vocal music information are useful for sung language recognition
systems (Tsai et at., 2004., Schwenninger et al., 2006), lyrics-text and music alignment
systems (Wang et al., 2004), mood classification systems (Lu & Zhang, 2006) music genre
classification (Nwe & Li, 2007., Tzanetakis & Cook, 2002) and music classification systems
(Xu et al., 2005., Burred & Lerch, 2004) Also, information about rhythm, harmony, melody
20
Trang 10FigTimmemuresmeSca
1) First layer 2) Second laynotes simu3) Third layeinstrumen4) Forth laye
he pyramid diagurdain (1997) arformance, listen
g 1 Information
me information delody contours an
usic Melody is cr
sults in harmonyechanism can effeale changes or m
represents the timyer represents theultaneously;
er describes the mntal mixed vocal (
r and above repregram represents also discussed ning, understandi
grouping in the describes the rate
nd phrases whichreated when a s
y sound Psychoectively distinguimodulation of the
me information (
e harmony/melomusic regions, i.e
(IMV) and silenceesent the semanti music semantichow sound, to
ng and ecstasy le
music structure p
of information f
h create music resingle note is plaological studies ish the tones of th scale in a differe
beats, tempo, andody which is form pure vocal (PV),
e (S);
ics of the popular
cs which influenone, melody, head to our imagin
pyramid low in music Duegions are propoayed at a time Phave suggested
he diatonic scale ent section of the
d meter);
med by playing m pure instrument
r song
nce our imaginharmony, componation
urations of Harmortional to the temPlaying multiple the human cog(Burred & Lerch, song can effectiv
musical
al (PI),
ations osition,
mony / mpo of
e notes gnitive , 2004) vely be
contours and song structures (such as repetitions of chorus verse semantic regions) are
useful for developing systems for error concealment in music streaming (Wang et al., 2003),
music protection (watermarking), music summarization (Xu et al., 2005), compression, and
music search
Computer music research community has been developing algorithms to accurately extract
the information in music Many of the proposed algorithms require ground truth data for
both the parameter training process and performance evaluation For example, the
performance of a music classifier which classify the content in the music segment as vocal or
non-vocal, can be improved when the parameters of the classifier are trained with accurate
vocal and non-vocal music contents in the development dataset Also the performance of the
classifier can effectively be measured when the evaluation dataset is accurately annotated
based on the exact music composition information However it is difficult to create accurate
development and evaluation datasets because it is difficult to find information about the
music composition mainly due to copyright restrictions on sharing music information in the
public domain Therefore, the current development and evaluation datasets are created by
annotating the information that is extracted using subjective listening tests Tanghe et al.,
(2005) discussed an annotation method for drum sounds In Goto (2006)’s method, music
scenes such as beat structure, chorus, and melody line are annotated with the help of
corresponding MIDI files Li, et al., (2006) modified the general audio editing software so
that it becomes more convenient for identifying music semantic regions such as chorus The
accuracy of subjective listening test hinges on subject’s hearing competence, concentration
and music knowledge For example, it is often difficult to judge the start and end time of
vocal phrases when they are presented with strong background music If the listener’s
concentration is disturbed, then the listening continuity is lost and then it is difficult to
accurately mark the phrase boundaries However if we know the tempo and meter of the
music, then we can apply that knowledge to correct the errors of the phrase boundaries
which are detected in the listen tests
Speed of music information flow is directly proportional to tempo of the music (Authors,
1949) Therefore the duration of music regions, semantic regions, inter-beat interval, and
beat positions can be measured as multiples of music notes The proposed music
information annotation technique in this chapter, first locates the beats and onset positions
by both listening and visualizing the music signal using a graphical waveform editor Since
the time duration of the detected beat or onset from the start of music is an integer multiple
of the duration of a smallest note, we can estimate the duration of the smallest note Then
we carry out intensive listening exercise with the help of estimated durations of the smallest
music note to detect the time stamps of music regions and different semantic regions Using
the annotated information, we detect the song structure and calculate the statistics of the
music information distributions
This chapter is organized as follows Popular music structure is discussed in section 2 and
effective information annotation procedures are explained in section 3 Section 4 details the
statistics of music information We conclude the chapter in section 5 with a discussion
2 Music Structure
As shown in Fig 1, the underlying music information can conceptually be represented as
layers in a pyramid (Maddage, 2005) These information layers are:
Trang 11FigTimmemuresmeSca
1) First layer 2) Second laynotes simu3) Third layeinstrumen4) Forth laye
he pyramid diagurdain (1997) arformance, listen
g 1 Information
me information delody contours an
usic Melody is cr
sults in harmonyechanism can effeale changes or m
represents the timyer represents theultaneously;
er describes the mntal mixed vocal (
r and above repregram represents also discussed ning, understandi
grouping in the describes the rate
nd phrases whichreated when a s
y sound Psychoectively distinguimodulation of the
me information (
e harmony/melomusic regions, i.e
(IMV) and silenceesent the semanti music semantichow sound, to
ng and ecstasy le
music structure p
of information f
h create music resingle note is plaological studies ish the tones of th scale in a differe
beats, tempo, andody which is form pure vocal (PV),
e (S);
ics of the popular
cs which influenone, melody, head to our imagin
pyramid low in music Duegions are propoayed at a time Phave suggested
he diatonic scale ent section of the
d meter);
med by playing m pure instrument
r song
nce our imaginharmony, componation
urations of Harmortional to the temPlaying multiple the human cog(Burred & Lerch, song can effectiv
musical
al (PI),
ations osition,
mony / mpo of
e notes gnitive , 2004) vely be
contours and song structures (such as repetitions of chorus verse semantic regions) are
useful for developing systems for error concealment in music streaming (Wang et al., 2003),
music protection (watermarking), music summarization (Xu et al., 2005), compression, and
music search
Computer music research community has been developing algorithms to accurately extract
the information in music Many of the proposed algorithms require ground truth data for
both the parameter training process and performance evaluation For example, the
performance of a music classifier which classify the content in the music segment as vocal or
non-vocal, can be improved when the parameters of the classifier are trained with accurate
vocal and non-vocal music contents in the development dataset Also the performance of the
classifier can effectively be measured when the evaluation dataset is accurately annotated
based on the exact music composition information However it is difficult to create accurate
development and evaluation datasets because it is difficult to find information about the
music composition mainly due to copyright restrictions on sharing music information in the
public domain Therefore, the current development and evaluation datasets are created by
annotating the information that is extracted using subjective listening tests Tanghe et al.,
(2005) discussed an annotation method for drum sounds In Goto (2006)’s method, music
scenes such as beat structure, chorus, and melody line are annotated with the help of
corresponding MIDI files Li, et al., (2006) modified the general audio editing software so
that it becomes more convenient for identifying music semantic regions such as chorus The
accuracy of subjective listening test hinges on subject’s hearing competence, concentration
and music knowledge For example, it is often difficult to judge the start and end time of
vocal phrases when they are presented with strong background music If the listener’s
concentration is disturbed, then the listening continuity is lost and then it is difficult to
accurately mark the phrase boundaries However if we know the tempo and meter of the
music, then we can apply that knowledge to correct the errors of the phrase boundaries
which are detected in the listen tests
Speed of music information flow is directly proportional to tempo of the music (Authors,
1949) Therefore the duration of music regions, semantic regions, inter-beat interval, and
beat positions can be measured as multiples of music notes The proposed music
information annotation technique in this chapter, first locates the beats and onset positions
by both listening and visualizing the music signal using a graphical waveform editor Since
the time duration of the detected beat or onset from the start of music is an integer multiple
of the duration of a smallest note, we can estimate the duration of the smallest note Then
we carry out intensive listening exercise with the help of estimated durations of the smallest
music note to detect the time stamps of music regions and different semantic regions Using
the annotated information, we detect the song structure and calculate the statistics of the
music information distributions
This chapter is organized as follows Popular music structure is discussed in section 2 and
effective information annotation procedures are explained in section 3 Section 4 details the
statistics of music information We conclude the chapter in section 5 with a discussion
2 Music Structure
As shown in Fig 1, the underlying music information can conceptually be represented as
layers in a pyramid (Maddage, 2005) These information layers are:
Trang 12both verse and chorus are equally melodically strong Most people can hum or sing both chorus and verse A Bridge links the gap between the Verse and Chorus, and may have only two or four bars
Fig 4 Two examples for verse- chorus pattern repetitions
noticed in the listening tests Therefore, in our listening tests we detect the Middle-eighth
regions (see next section) which have a different Key from the main Key of the song
The rhythm of words can be tailored to fit into a music phrase (Authors, 1949) The vocal
regions in music comprise of words and syllables, which are uttered according to a time
signature Fig 2 shows how the words “Little Jack Horner sat in the Corner” are turn into a
rhythm, and the music notation of those words The important words or syllables in the
sentence fall onto accents to form the rhythm of the music Typically, these words are placed
at the first beat of a bar When TS is set to two Crotchet beats per bar, we see the duration of
the word “Little” is equal to two Quaver notes and the duration of the word “Jack” is equal
to a Crotchet note
Fig 2 Rhythmic flow of words
Popular song structure often contains Intro, Verse, Chorus, Bridge, Middle-eighth,
instrumental sections (INST) and Outro (Authors, 2003) As shown in Fig 1, these parts are
built on melody-based similarity regions and content-based similarity regions
Melody-based similarity regions are defined as the regions which have similar pitch contours
constructed from the chord patterns Content-based similarity regions are defined as the
regions which have both similar vocal content and melody In terms of music structure, the
Chorus sections and Verse sections in a song are considered the content-based similarity
regions and melody-based similarity regions respectively They can be grouped to form
semantic clusters as in Fig 3 For example, all the Chorus regions in a song form a Chorus
cluster, while all the Verse regions form a Verse cluster and so on
Fig 3 Semantic similarity clusters which define the structure of the popular song
A song may have an Intro of 2, 4, 8 or 16 bars long, or do not have any at all The Intro is
usually comprised of instrumental music Both Verse and Chorus are 8 or 16 bars long
Typically, the Verse is not melodically as strong as the Chorus However, in some songs
Chorus 1
Chorus 2 Chorus 3
Chorus n Chorus i
Verse 1Verse 2Verse 3
Verse nVerse k
Semantic clusters (regions) in
a popular song
Trang 13both verse and chorus are equally melodically strong Most people can hum or sing both chorus and verse A Bridge links the gap between the Verse and Chorus, and may have only two or four bars
Fig 4 Two examples for verse- chorus pattern repetitions
noticed in the listening tests Therefore, in our listening tests we detect the Middle-eighth
regions (see next section) which have a different Key from the main Key of the song
The rhythm of words can be tailored to fit into a music phrase (Authors, 1949) The vocal
regions in music comprise of words and syllables, which are uttered according to a time
signature Fig 2 shows how the words “Little Jack Horner sat in the Corner” are turn into a
rhythm, and the music notation of those words The important words or syllables in the
sentence fall onto accents to form the rhythm of the music Typically, these words are placed
at the first beat of a bar When TS is set to two Crotchet beats per bar, we see the duration of
the word “Little” is equal to two Quaver notes and the duration of the word “Jack” is equal
to a Crotchet note
Fig 2 Rhythmic flow of words
Popular song structure often contains Intro, Verse, Chorus, Bridge, Middle-eighth,
instrumental sections (INST) and Outro (Authors, 2003) As shown in Fig 1, these parts are
built on melody-based similarity regions and content-based similarity regions
Melody-based similarity regions are defined as the regions which have similar pitch contours
constructed from the chord patterns Content-based similarity regions are defined as the
regions which have both similar vocal content and melody In terms of music structure, the
Chorus sections and Verse sections in a song are considered the content-based similarity
regions and melody-based similarity regions respectively They can be grouped to form
semantic clusters as in Fig 3 For example, all the Chorus regions in a song form a Chorus
cluster, while all the Verse regions form a Verse cluster and so on
Fig 3 Semantic similarity clusters which define the structure of the popular song
A song may have an Intro of 2, 4, 8 or 16 bars long, or do not have any at all The Intro is
usually comprised of instrumental music Both Verse and Chorus are 8 or 16 bars long
Typically, the Verse is not melodically as strong as the Chorus However, in some songs
Chorus 1
Chorus 2 Chorus 3
Chorus n Chorus i
Verse 1Verse 2Verse 3
Verse nVerse k
Semantic clusters (regions) in
a popular song
Trang 14Fig 5 Spectral and time domain visualization of (0~3657) ms long song clip from “25 Minutes” by MLTR Quarter note length is 736.28 ms and note boundaries are highlighted using dotted lines
3.1 Computation of Inter-beat interval
Once the staff of a song is available, from the value of tempo and time signature we can calculate the duration of the beat However commercially available music albums (CDs) do not provide staff information of the songs Therefore subjects with a good knowledge of theory and practice of music have to closely examine the songs to estimate the inter-beat intervals in the song We assume all the songs have 4/4 time signature, which is the commonly used TS in popular songs (Goto, M 2001, Authors, 2003) Following the results
of music composition and structure, as discussed in section 2, we only allow the positions of beats to take place at integer multiple of smaller notes from the start point of the song Estimation of both inter-beat interval and song tempo using an iterative listening is explained below, with Fig 6 as an example
Play the song in audio editing software which has a GUI to visualize the time domain signal with high resolution While listening to the music it is noticed that there is a steady throb to which one can clap This duration of consecutive clapping is called inter-beat interval As we assume 4/4 time signature which infers that the inter-beat interval is of quarter note length, hence four quarter notes form a bar
As shown in Fig 6, the positions of both beats and note onsets can be effectively
visualized on the GUI, and j th position is indicated asPj By replaying the song and zooming into the areas of neighboring beats and onset positions, we can estimate the
(0 ~ 3657)ms of the song “25 Minutes -MLTR”
14000 Frequency spectrum (0~15kHz) range
Time in millisecond (ms)
Time domain signal
0 500 1000 1500 2000 2500 3000 3500 -0.5
-0.3 -0.100.10.3 0.5
Silence may also act as a Bridge between the Verse and Chorus of a song, but such cases are
rare Middle-eighth, which has 4, 8 or 16 bars in length, is an alternative version of a Verse
with a new chord progression possibly modulated by a different key Many people use the
term “Middle-eighth” and “bridge” synonymously However, the main difference is the
middle-eighth is longer (usually 16 bars) than the bridge and usually appears after the third
verse in the song There are instrumental sections in the song and they can be instrumental
versions of the Chorus, Verse, or entirely different tunes with a set of chords together
Typically INST regions have 8 or 16 bars Outro, which is the ending of the song, is usually a
fade–out of the last phrases of the chorus We have described the parts of the song which are
commonly arranged according to the simple verse-chorus and repeat pattern Two
variations on these themes are as follows:
(a) Intro, Verse 1, Verse 2, Chorus, Verse 3, Middle-eighth, Chorus, Chorus,
Outro
(b) Intro, Verse 1, Chorus, Verse 2, Chorus, Chorus, Outro
Fig 4 illustrates two examples of the above two patterns Song, “25 minutes” by MLTR
follows the pattern (a) and “Can’t Let You Go” by Mariah Carey follows the pattern (b) For
a better understanding of how artist have combined these parts to compose a song, we
conducted a survey on popular Chinese and English songs Details of the survey are
discussed in the next section
3 Music Structure Information Annotation
The fundamental step for audio content analysis is signal segmentation Within a segment,
the information can be considered quasi-stationary Feature extraction and information
modeling followed by music segmentation are the essential steps for music structure
analysis Determination of the segment size, which is suitable for extracting certain level of
information, requires better understanding of the rate of information flow in the audio data
Over three decades of speech processing research has revealed that 20-40 ms of fixed length
signal segmentation is appropriate for the speech content analysis (Rabiner & Juang, 2005)
The composition of music piece reveals the rate of information such as notes, chords, key,
vocal phrases, flow is proportional to inter-beat intervals
Fig 5 shows the quarter, eighth and sixteenth note boundaries in a song clip It can be seen
that the fluctuation of signal properties in both spectral and time domain are aligned with
those note boundaries Usually smaller notes, such as eighth, sixteenth and thirty-second
notes or smaller are played in the bars to align the harmony contours with the rhythm flow
of the lyrics and to fill the gap between lyrics (Authors, 1949) Therefore inter-beat
proportional music segmentation instead of fixed length segmentation has recently been
proposed for music content analysis (Maddage, 2004., Maddage, 2005., Wang, 2004)
Trang 15Fig 5 Spectral and time domain visualization of (0~3657) ms long song clip from “25 Minutes” by MLTR Quarter note length is 736.28 ms and note boundaries are highlighted using dotted lines
3.1 Computation of Inter-beat interval
Once the staff of a song is available, from the value of tempo and time signature we can calculate the duration of the beat However commercially available music albums (CDs) do not provide staff information of the songs Therefore subjects with a good knowledge of theory and practice of music have to closely examine the songs to estimate the inter-beat intervals in the song We assume all the songs have 4/4 time signature, which is the commonly used TS in popular songs (Goto, M 2001, Authors, 2003) Following the results
of music composition and structure, as discussed in section 2, we only allow the positions of beats to take place at integer multiple of smaller notes from the start point of the song Estimation of both inter-beat interval and song tempo using an iterative listening is explained below, with Fig 6 as an example
Play the song in audio editing software which has a GUI to visualize the time domain signal with high resolution While listening to the music it is noticed that there is a steady throb to which one can clap This duration of consecutive clapping is called inter-beat interval As we assume 4/4 time signature which infers that the inter-beat interval is of quarter note length, hence four quarter notes form a bar
As shown in Fig 6, the positions of both beats and note onsets can be effectively
visualized on the GUI, and j th position is indicated asPj By replaying the song and zooming into the areas of neighboring beats and onset positions, we can estimate the
(0 ~ 3657)ms of the song “25 Minutes -MLTR”
14000 Frequency spectrum (0~15kHz) range
Time in millisecond (ms)
Time domain signal
0 500 1000 1500 2000 2500 3000 3500 -0.5
-0.3 -0.100.10.3 0.5
Silence may also act as a Bridge between the Verse and Chorus of a song, but such cases are
rare Middle-eighth, which has 4, 8 or 16 bars in length, is an alternative version of a Verse
with a new chord progression possibly modulated by a different key Many people use the
term “Middle-eighth” and “bridge” synonymously However, the main difference is the
middle-eighth is longer (usually 16 bars) than the bridge and usually appears after the third
verse in the song There are instrumental sections in the song and they can be instrumental
versions of the Chorus, Verse, or entirely different tunes with a set of chords together
Typically INST regions have 8 or 16 bars Outro, which is the ending of the song, is usually a
fade–out of the last phrases of the chorus We have described the parts of the song which are
commonly arranged according to the simple verse-chorus and repeat pattern Two
variations on these themes are as follows:
(a) Intro, Verse 1, Verse 2, Chorus, Verse 3, Middle-eighth, Chorus, Chorus,
Outro
(b) Intro, Verse 1, Chorus, Verse 2, Chorus, Chorus, Outro
Fig 4 illustrates two examples of the above two patterns Song, “25 minutes” by MLTR
follows the pattern (a) and “Can’t Let You Go” by Mariah Carey follows the pattern (b) For
a better understanding of how artist have combined these parts to compose a song, we
conducted a survey on popular Chinese and English songs Details of the survey are
discussed in the next section
3 Music Structure Information Annotation
The fundamental step for audio content analysis is signal segmentation Within a segment,
the information can be considered quasi-stationary Feature extraction and information
modeling followed by music segmentation are the essential steps for music structure
analysis Determination of the segment size, which is suitable for extracting certain level of
information, requires better understanding of the rate of information flow in the audio data
Over three decades of speech processing research has revealed that 20-40 ms of fixed length
signal segmentation is appropriate for the speech content analysis (Rabiner & Juang, 2005)
The composition of music piece reveals the rate of information such as notes, chords, key,
vocal phrases, flow is proportional to inter-beat intervals
Fig 5 shows the quarter, eighth and sixteenth note boundaries in a song clip It can be seen
that the fluctuation of signal properties in both spectral and time domain are aligned with
those note boundaries Usually smaller notes, such as eighth, sixteenth and thirty-second
notes or smaller are played in the bars to align the harmony contours with the rhythm flow
of the lyrics and to fill the gap between lyrics (Authors, 1949) Therefore inter-beat
proportional music segmentation instead of fixed length segmentation has recently been
proposed for music content analysis (Maddage, 2004., Maddage, 2005., Wang, 2004)
Trang 16Step 3: At beat/ onset positionPj, we calculate the new note length Xj1 as below
j
j
P X
NF
Step 4: Iterate the step 1 to 3 at beat or onset positions towards the end of the songs When these iterative steps are carried out over many of the beat and onset positions towards the end of the song, the errors of the estimated note length are minimized Based on the final length estimate for the note, we can calculate the quarter note length
Fig 7 shows the variation of the estimated quarter note length for two songs Beat/onset positions nearly divide the song into equal intervals Beat/onset point zero (“0”) represents the first estimation of quarter note length The correct tempos of the songs “You are still one” and “The woman in me” are 67 BPM and 60 BPM respectively
Fig 7 Variation of estimated length of the quarter note at beat/onset points when the listening test was carried out till the end of the song
It can be seen in the Fig 7, that the deviation of the estimated quarter note is high at the beginning of the song However estimated quarter note converges to the correct value at the second half of the song (end of the song) Reason for the fluctuation of the estimated note length is explained below
As shown in Fig 6, first estimation of the note length (X1) is done using only audio-visual editing software Thus first estimation (beat/ onset point P = 0 in Fig 7) can have very high variation due to the prime difficulties of judging the correct boundaries of the notes When the song proceeds, using Eq, (1), (2), (3) and (4), we iteratively estimate the duration of the note for the corresponding beat/onset points Since beat/onset points near the start of the
song have shorter duration (P j), initial iterative estimations for the note length have higher
variation For example in Fig 6, beat/onset point P 1 is closer to start of the song and from
the Eq (1) and first estimation of note length X 1 , we compute NF 1 Eq (2) and (3) are useful
in limiting the errors in computed NF under one frame When X 1 is inaccurate and also P 1 is
short then the errors in computed number of frames NF 1 in Eq (1) have higher effects in the
next estimated note length in Eq (4) However with distant beat/onset points, i.e Pj is longer and NF j is high and more accurate, then the estimated note lengths tend to converge
You are still the one - Shania twain
Beat/onset point number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
970 980 990 1000 1010 1020 1030
1040
The woman in me - Shania twain
Beat/onset point number
inter-beat interval and therefore the duration of the noteXj In Fig 6, we can see the
duration X1 is the first estimated eighth note
Fig 6 Estimation of music note
After the first estimation, we establish a 4-step iterative listening process discussed
below to reduce the error between the estimates and the desired music note lengths
The constraint we apply is that beat position is equal to integer multiple of frames To
start with, first frame size is set to the estimated note, i.e Frame size = X1in Fig 6
Step 1: Set the currently estimated note length as the frame size; calculate the number of
frames NFjat an identified beat or onset position For the initialization, we set j 1
= j
j j
P NF
Step 2: As the resulting NFj is typically a floating point value, we measure the difference
between round up NFj and NFj, referred to as DNF
Trang 17Step 3: At beat/ onset positionPj, we calculate the new note length Xj1 as below
j
j
P X
NF
Step 4: Iterate the step 1 to 3 at beat or onset positions towards the end of the songs When these iterative steps are carried out over many of the beat and onset positions towards the end of the song, the errors of the estimated note length are minimized Based on the final length estimate for the note, we can calculate the quarter note length
Fig 7 shows the variation of the estimated quarter note length for two songs Beat/onset positions nearly divide the song into equal intervals Beat/onset point zero (“0”) represents the first estimation of quarter note length The correct tempos of the songs “You are still one” and “The woman in me” are 67 BPM and 60 BPM respectively
Fig 7 Variation of estimated length of the quarter note at beat/onset points when the listening test was carried out till the end of the song
It can be seen in the Fig 7, that the deviation of the estimated quarter note is high at the beginning of the song However estimated quarter note converges to the correct value at the second half of the song (end of the song) Reason for the fluctuation of the estimated note length is explained below
As shown in Fig 6, first estimation of the note length (X1) is done using only audio-visual editing software Thus first estimation (beat/ onset point P = 0 in Fig 7) can have very high variation due to the prime difficulties of judging the correct boundaries of the notes When the song proceeds, using Eq, (1), (2), (3) and (4), we iteratively estimate the duration of the note for the corresponding beat/onset points Since beat/onset points near the start of the
song have shorter duration (P j), initial iterative estimations for the note length have higher
variation For example in Fig 6, beat/onset point P 1 is closer to start of the song and from
the Eq (1) and first estimation of note length X 1 , we compute NF 1 Eq (2) and (3) are useful
in limiting the errors in computed NF under one frame When X 1 is inaccurate and also P 1 is
short then the errors in computed number of frames NF 1 in Eq (1) have higher effects in the
next estimated note length in Eq (4) However with distant beat/onset points, i.e Pj is longer and NF j is high and more accurate, then the estimated note lengths tend to converge
You are still the one - Shania twain
Beat/onset point number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
970 980 990 1000 1010 1020 1030
1040
The woman in me - Shania twain
Beat/onset point number
inter-beat interval and therefore the duration of the noteXj In Fig 6, we can see the
duration X1 is the first estimated eighth note
Fig 6 Estimation of music note
After the first estimation, we establish a 4-step iterative listening process discussed
below to reduce the error between the estimates and the desired music note lengths
The constraint we apply is that beat position is equal to integer multiple of frames To
start with, first frame size is set to the estimated note, i.e Frame size = X1in Fig 6
Step 1: Set the currently estimated note length as the frame size; calculate the number of
frames NFjat an identified beat or onset position For the initialization, we set j 1
= j
j j
P NF
Step 2: As the resulting NFj is typically a floating point value, we measure the difference
between round up NFj and NFj, referred to as DNF
step 1 ELSE