1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Recent Advances in Signal Processing 2011 Part 11 ppt

35 305 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Recent Advances in Signal Processing 2011 Part 11 ppt
Trường học University of Signal Processing Research
Chuyên ngành Signal Processing
Thể loại Lecture notes
Năm xuất bản 2011
Thành phố Unknown
Định dạng
Số trang 35
Dung lượng 6,18 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Segment analysis: a source waveform segment; b estimated fundamental frequency contour; c estimated harmonic amplitudes; d estimated stochastic part; e spectrogram of the source segment;

Trang 1

b)

c) Fig 8 Harmonic parameters estimation: a) source signal; b) estimated deterministic part; c) estimated stochastic part

An example of harmonic analysis is presented in Figure 8(a) The source signal is a phrase uttered by a male speaker (ܨ௦ൌ ͺkHz) The deterministic part of the signal Figure 8(b) was synthesized using estimated harmonic parameters and subtracted from the source in order

to get the stochastic part Figure 9(c) The spectrograms show that all steady harmonics of the source are modelled by sinusoidal representation when the residual part contains transient and noise components

7.2 Harmonic analysis in TTS systems

This subsection presents an experimental application of sinusoidal modelling with proposed analysis techniques to a TTS system Despite the fact that many different techniques have been proposed, segment concatenation is still the major approach to speech synthesis The speech segments (allophones) are assembled into synthetic speech and this process involves time-scale and pitch-scale modifications in order to produce natural-like sounds The concatenation can be carried out either in time or frequency domain Most time domain techniques are similar to the Pitch-Synchronous Overlap and Add method (PSOLA) (Moulines and Charpentier, 1990) The speech waveform is separated into short-time signals

by the analysis pitch-marks (that are defined by the source pitch contour) and then processed and joined by the synthesis pitch-marks (that are defined by the target pitch contour) The process requires accurate pitch estimation of the source waveform Placing

Fig 7 Frame analysis by autocorrelation and sinusoidal parameters conversion: a)

autocorrelation spectrum estimation; b) autocorrelation residual; c) instantaneous LPC

spectrum; d) instantaneous residual

7 Experimental applications

The described methods of sinusoidal and harmonic analysis can be used in several speech

processing systems This section presents some application results

7.1 Application of harmonic analysis to parametric speech coding

Accurate estimation of sinusoidal parameters can significantly improve performance of

coding systems Well-known compressing algorithms that use sinusoidal representation

may benefit from fine accurate harmonic/residual separation, providing higher quality of

the decoded signal The described analysis technique has been applied to hybrid speech and

audio coding (Petrovsky et al., 2008)

a)

Trang 2

e) f) Fig 9 Segment analysis: a) source waveform segment; b) estimated fundamental frequency contour; c) estimated harmonic amplitudes; d) estimated stochastic part; e) spectrogram of the source segment; f) spectrogram of the stochastic part

The periodical signal ����� with pitch shifting can be synthesized from its parametric representation as follows:

In synthesis process the phase differences ������ are good substitutions of phase parameters

����� since all the harmonics are kept coordinated regardless of the frequency contour and the initial phase of the fundamental

Due to parametric representation spectral amplitude and phase mismatches at segments borders can be efficiently smoothed Spectral amplitudes of acoustically related sounds can

be matched by simultaneous fading out and in that is equivalent to linear spectral smoothing (Dutoit 1997) Phase discontinuities are also can be matched by linear laws taking into account that harmonic components are represented by their relative phases

������ However, large discontinuities (when absolute difference exceeds �) should be eliminated by adding multiplies of ��� to the phase parameters of the next segment Thus, phase parameters are smoothed in the same way as spectral amplitudes, providing imperceptible concatenation of the segments

In Figure 10 the proposed approach is compared with PSOLA synthesis, implemented as described in (Moulines and Charpentier, 1990) A fragment of speech in Russian was synthesized through two different techniques using the same source acoustic database The

analysis pitch-marks is an important stage that significantly affects synthesis quality

Frequency domain (parametric) techniques deal with frequency representations of the

segments instead of their waveforms what requires prior transformation of the acoustic

database to frequency domain Harmonic modelling can be especially useful in TTS systems

for the following reasons:

- explicit control over pitch, tempo and timbre of the speech segments that insures

proper prosody matching ;

- high-quality segment concatenation can be performed using simple linear

smoothing laws;

- acoustic database can be highly compressed;

- synthesis can be implemented with low computational complexity

In order to perform real-time synthesis in harmonic domain all waveform speech segments

should be analysed and stored in new database, which contains estimated harmonic

parameters and waveforms of stochastic signals The analysis technique described in the

chapter can be used for parameterization In Figure 9 a result of such parameterization is

presented The analysed segment is sound [a:] of a female voice

Speech concatenation with prosody matching can be efficiently implemented using

sinusoidal modelling In order to modify durations of the segments the harmonic

parameters are recalculated at new instants, that are defined by some dynamic warping

function, the noise part is parameterized by spectral envelopes and then time-scaled as

described in (Levine and Smith, 1998)

Changing the pitch of a segment requires recalculation of harmonic amplitudes, maintaining

the original spectral envelope Noise part of the segment is not affected by pitch shifting and

obviously should remain untouched Let us consider the instantaneous frequency envelope

as a function ܧሺ݊ǡ ݂ሻ of two parameters (sample number and frequency respectively) After

harmonic parameterization the function is defined at frequencies of the harmonic

components that were calculated at the respective instants of time: ܧ൫݊ǡ ݂௞ ௞ሺ݊ሻ

In order to get the completely defined function the piecewise-linear interpolation is used

Such interpolation has low computational complexity and, at the same time, gives

sufficiently good approximation (Dutoit 1997)

Trang 3

e) f) Fig 9 Segment analysis: a) source waveform segment; b) estimated fundamental frequency contour; c) estimated harmonic amplitudes; d) estimated stochastic part; e) spectrogram of the source segment; f) spectrogram of the stochastic part

The periodical signal ����� with pitch shifting can be synthesized from its parametric representation as follows:

In synthesis process the phase differences ������ are good substitutions of phase parameters

����� since all the harmonics are kept coordinated regardless of the frequency contour and the initial phase of the fundamental

Due to parametric representation spectral amplitude and phase mismatches at segments borders can be efficiently smoothed Spectral amplitudes of acoustically related sounds can

be matched by simultaneous fading out and in that is equivalent to linear spectral smoothing (Dutoit 1997) Phase discontinuities are also can be matched by linear laws taking into account that harmonic components are represented by their relative phases

������ However, large discontinuities (when absolute difference exceeds �) should be eliminated by adding multiplies of ��� to the phase parameters of the next segment Thus, phase parameters are smoothed in the same way as spectral amplitudes, providing imperceptible concatenation of the segments

In Figure 10 the proposed approach is compared with PSOLA synthesis, implemented as described in (Moulines and Charpentier, 1990) A fragment of speech in Russian was synthesized through two different techniques using the same source acoustic database The

analysis pitch-marks is an important stage that significantly affects synthesis quality

Frequency domain (parametric) techniques deal with frequency representations of the

segments instead of their waveforms what requires prior transformation of the acoustic

database to frequency domain Harmonic modelling can be especially useful in TTS systems

for the following reasons:

- explicit control over pitch, tempo and timbre of the speech segments that insures

proper prosody matching ;

- high-quality segment concatenation can be performed using simple linear

smoothing laws;

- acoustic database can be highly compressed;

- synthesis can be implemented with low computational complexity

In order to perform real-time synthesis in harmonic domain all waveform speech segments

should be analysed and stored in new database, which contains estimated harmonic

parameters and waveforms of stochastic signals The analysis technique described in the

chapter can be used for parameterization In Figure 9 a result of such parameterization is

presented The analysed segment is sound [a:] of a female voice

Speech concatenation with prosody matching can be efficiently implemented using

sinusoidal modelling In order to modify durations of the segments the harmonic

parameters are recalculated at new instants, that are defined by some dynamic warping

function, the noise part is parameterized by spectral envelopes and then time-scaled as

described in (Levine and Smith, 1998)

Changing the pitch of a segment requires recalculation of harmonic amplitudes, maintaining

the original spectral envelope Noise part of the segment is not affected by pitch shifting and

obviously should remain untouched Let us consider the instantaneous frequency envelope

as a function ܧሺ݊ǡ ݂ሻ of two parameters (sample number and frequency respectively) After

harmonic parameterization the function is defined at frequencies of the harmonic

components that were calculated at the respective instants of time: ܧ൫݊ǡ ݂௞ ௞ሺ݊ሻ

In order to get the completely defined function the piecewise-linear interpolation is used

Such interpolation has low computational complexity and, at the same time, gives

sufficiently good approximation (Dutoit 1997)

Trang 4

The autocorrelation analysis was carried out with analysis frame 512 samples in length, weighted by the Hamming window Prediction order was 20 in both cases

8 Conclusions

An estimation technique of instantaneous sinusoidal parameters has been presented in the chapter The technique is based on narrow-band filtering and can be applied to audio and speech sounds Signals with harmonic structure (such as voiced speech) can be analysed using frequency-modulated filters with adjustable impulse response The technique has a good performance considering that accurate estimation is possible even in case of rapid frequency modulations of pitch A method of pitch detection and estimation has been described as well The use of filters with modulated impulse response, however, requires precise estimation of instantaneous pitch that can be achieved through pitch values recalculation during the analysis process The main disadvantage of the method is high computational cost in comparison with STFT

Some experimental applications of the proposed approach have been illustrated The sinusoidal modelling based on the presented technique has been applied to speech coding, and TTS synthesis with wholly satisfactory results

The sinusoidal model can be used for estimation of LPC parameters that describe instantaneous behaviour of the periodical signal The presented conversion technique of sinusoidal parameters into prediction coefficients provides high energy localization and smaller residual for frequency-modulated signals, however overall performance entirely depends on the quality of prior sinusoidal analysis The instantaneous prediction

database segments were picked out from the speech of a female speaker The sound sample

in Figure 10(a) is the result of the PSOLA method

Fig 10 TTS synthesis comparison: a) PSOLA synthesis; b) harmonic domain concatenation

In Figure 10(b) the sound sample is shown, that is the result of the described

analysis/synthesis approach In order to get the parametric representation of the acoustic

database each segment was classified either as voiced or unvoiced The unvoiced segments

were left untouched while the voiced were analyzed by the technique described in Section 4,

then prosody modifications and segment concatenation were carried out Both sound

samples were synthesized at 22kHz, using the same predefined pitch contour

As can be noticed from the presented samples the time domain concatenation approach

produces audible artefacts at segment borders They are caused by phase and pitch

mismatching, that cannot be effectively avoided during synthesis The described parametric

approach provides almost inaudible phase and pitch smoothing, without distorting spectral

and formant structure of the segments The experiments have shown that this technique is

good enough even for short and fricative segments, however, the short Russian ‘r’ required

special adjustment of the filter parameters at the analysis stage in order to make proper

analysis of the segment

The main drawback of the described approach is noise amplification immediately at

segment borders where the analysis filter gives less accurate results because of spectral

leakage In the current experiment the problem was solved by fading out the estimated

noise part at segment borders It is also possible to pick out longer segments at the database

preparation stage and then shorten them after parameterization

7.3 Instantaneous LPC analysis of speech

LPC-based techniques are widely used for formant tracking in speech applications Making

harmonic analysis first and then performing parameters conversion a higher accuracy of

formant frequencies estimation can be achieved In Figure 11 a result of voiced speech

analysis is presented The analysed signal (Figure 11(a)) is a vowel [a:] uttered by a male

speaker This sound was sampled at 8kHz and analyzed by the autocorrelation (Figure

11(b)) and the harmonic conversion (Figure 11(c)) techniques In order to give expressive

pictures prediction coefficients were updated for every sample of the signal in both cases

Trang 5

The autocorrelation analysis was carried out with analysis frame 512 samples in length, weighted by the Hamming window Prediction order was 20 in both cases

8 Conclusions

An estimation technique of instantaneous sinusoidal parameters has been presented in the chapter The technique is based on narrow-band filtering and can be applied to audio and speech sounds Signals with harmonic structure (such as voiced speech) can be analysed using frequency-modulated filters with adjustable impulse response The technique has a good performance considering that accurate estimation is possible even in case of rapid frequency modulations of pitch A method of pitch detection and estimation has been described as well The use of filters with modulated impulse response, however, requires precise estimation of instantaneous pitch that can be achieved through pitch values recalculation during the analysis process The main disadvantage of the method is high computational cost in comparison with STFT

Some experimental applications of the proposed approach have been illustrated The sinusoidal modelling based on the presented technique has been applied to speech coding, and TTS synthesis with wholly satisfactory results

The sinusoidal model can be used for estimation of LPC parameters that describe instantaneous behaviour of the periodical signal The presented conversion technique of sinusoidal parameters into prediction coefficients provides high energy localization and smaller residual for frequency-modulated signals, however overall performance entirely depends on the quality of prior sinusoidal analysis The instantaneous prediction

database segments were picked out from the speech of a female speaker The sound sample

in Figure 10(a) is the result of the PSOLA method

Fig 10 TTS synthesis comparison: a) PSOLA synthesis; b) harmonic domain concatenation

In Figure 10(b) the sound sample is shown, that is the result of the described

analysis/synthesis approach In order to get the parametric representation of the acoustic

database each segment was classified either as voiced or unvoiced The unvoiced segments

were left untouched while the voiced were analyzed by the technique described in Section 4,

then prosody modifications and segment concatenation were carried out Both sound

samples were synthesized at 22kHz, using the same predefined pitch contour

As can be noticed from the presented samples the time domain concatenation approach

produces audible artefacts at segment borders They are caused by phase and pitch

mismatching, that cannot be effectively avoided during synthesis The described parametric

approach provides almost inaudible phase and pitch smoothing, without distorting spectral

and formant structure of the segments The experiments have shown that this technique is

good enough even for short and fricative segments, however, the short Russian ‘r’ required

special adjustment of the filter parameters at the analysis stage in order to make proper

analysis of the segment

The main drawback of the described approach is noise amplification immediately at

segment borders where the analysis filter gives less accurate results because of spectral

leakage In the current experiment the problem was solved by fading out the estimated

noise part at segment borders It is also possible to pick out longer segments at the database

preparation stage and then shorten them after parameterization

7.3 Instantaneous LPC analysis of speech

LPC-based techniques are widely used for formant tracking in speech applications Making

harmonic analysis first and then performing parameters conversion a higher accuracy of

formant frequencies estimation can be achieved In Figure 11 a result of voiced speech

analysis is presented The analysed signal (Figure 11(a)) is a vowel [a:] uttered by a male

speaker This sound was sampled at 8kHz and analyzed by the autocorrelation (Figure

11(b)) and the harmonic conversion (Figure 11(c)) techniques In order to give expressive

pictures prediction coefficients were updated for every sample of the signal in both cases

Trang 6

McAulay, R J & Quateri T F (1992) The sinusoidal transform coder at 2400 b/s,

Proceedings of Military Communications Conference, Calif, USA, October 1992, San

Diego

Moulines, E & Charpentier, F (1990) Pitch Synchronous Waveform Processing Techniques

for Text-to-Speech Synthesis Using Diphones Speech Communication, Vol.9, No 5-6,

(1990) 453-467

Painter, T & Spanias, A (2003) Sinusoidal Analysis-Synthesis of Audio Using Perceptual

Criteria EURASIP Journal on Applied Signal Processing, No l, (2003) 15-20

Petrovsky, A.; Stankevich, A & Balunowski, J (1999) The order tracking front-end

algorithms in the rotating machine monitoring systems based on the new digital

low order tracking, Proc of the 6th Intern Congress “On sound and vibration”,

pp.2985-2992, Denmark, 1999, Copenhagen

Petrovsky, A.; Azarov, E & Petrovsky, A (2008) Harmonic representation and auditory

model-based parametric matching and its application in speech/audio analysis,

AES 126th Convention, Preprint 7705, Munich, Germany

Rabiner, L & Juang, B.H (1993) Fundamentals of speech recognition, Prentice Hall, New

Jersey

Serra, X (1989) A system for sound analysis/transformation/synthesis based on a

deterministic plus stochastic decomposition, Ph.D thesis, Stanford University,

Stanford, Calif, USA

Spanias, A.S (1994) Speech coding: a tutorial review Proc of the IEEE, Vol 82, No 10, (1994)

1541-1582

Weruaga, L & Kepesi, M (2007) The fan-chirp transform for non-stationary harmonic

signals, Signal Processing, Vol 87, issue 6, (June 2007) 1-18

Zhang, F.; Bi, G & Chen Y.Q (2004) Harmonic transform, IEEE Proc.-Vis Image Signal

Process., Vol 151, No 4, (August 2004) 257-264

coefficients allow implementing fine formant tracking that can be useful in such applications

as speaker identification and speech recognition

Future work is aimed at further investigation of the analysis filters and their behaviour,

finding optimized solutions for evaluation of sinusoidal parameters It might be some

potential in adapting described methods to other applications such as vibration analyzer of

mechanical devices and diagnostics of throat diseases

9 Acknowledgments

This work was supported by the Belarusian republican fund for fundamental research

under the grant T08MC-040 and the Belarusian Ministry of Education under the grant

09-3102

10 References

Abe, T.; Kobayashi, T & Imai, S (1995) Harmonics tracking and pitch extraction based on

instantaneous frequency, Proceedings of ICASSP 1995 pp 756–759 1995

Azarov, E.; Petrovsky, A & Parfieniuk, M (2008) Estimation of the instantaneous harmonic

parameters of speech, Proceedings of the 16th European Signal Process Conf

(EUSIPCO-2008), CD-ROM, Lausanne, 2008

Boashash, B (1992) Estimating and interpreting the instantaneous frequency of a signal,

Proceedings of the IEEE, Vol 80, No 4, (1992) 520-568

Dutoit, T (1997) An Introduction to Text-to-speech Synthesis, Kluwer Academic Publishers, the

Netherlands

Gabor, D (1946) Theory of communication, Proc IEE, Vol.93, No 3, (1946) 429-457

Gianfelici, F.; Biagetti, G.; Crippa, P & Turchetti, C (2007) Multicomponent AM–FM

Representations: An Asymptotically Exact Approach, IEEE Transactions on Audio,

Speech, and Language Processing, Vol 15, No 3, (March 2007) 823-837

Griffin, D & Lim, J (1988) Multiband excitation vocoder, IEEE Trans On Acoustics, Speech

and Signal Processing, Vol 36, No 8, (1988) 1223-1235

Hahn, S L (1996) Hilbert Transforms in Signal Processing, MA: Artech House, Boston

Huang, X; Acero, A & Hon H.W (2001) Spoken language processing, Prentice Hall, New

Jersey

Levine, S & Smith, J (1998) A Sines+Transients+Noise Audio Representation for Data

Compression and Time/Pitch Scale Modifications, AES 105th Convention, Preprint

4781, San Francisco, CA, USA

Maragos, P.; Kaiser, J F & Quatieri, T F (1993) Energy Separation in Signal Modulations

with Application to Speech Analysis”, IEEE Trans On Signal Process., Vol 41, No

10, (1993) 3024-3051

Markel J.D & Gray A.H (1976) Linear prediction of speech, Springer-Verlag Berlin Heidelberg,

New York

McAulay, R J & Quatieri, T F (1986) Speech analysis/synthesis based on a sinusoidal

representation IEEE Trans On Acoustics, Speech and Signal Process., Vol 34, No 4,

(1986) 744-754

Trang 7

McAulay, R J & Quateri T F (1992) The sinusoidal transform coder at 2400 b/s,

Proceedings of Military Communications Conference, Calif, USA, October 1992, San

Diego

Moulines, E & Charpentier, F (1990) Pitch Synchronous Waveform Processing Techniques

for Text-to-Speech Synthesis Using Diphones Speech Communication, Vol.9, No 5-6,

(1990) 453-467

Painter, T & Spanias, A (2003) Sinusoidal Analysis-Synthesis of Audio Using Perceptual

Criteria EURASIP Journal on Applied Signal Processing, No l, (2003) 15-20

Petrovsky, A.; Stankevich, A & Balunowski, J (1999) The order tracking front-end

algorithms in the rotating machine monitoring systems based on the new digital

low order tracking, Proc of the 6th Intern Congress “On sound and vibration”,

pp.2985-2992, Denmark, 1999, Copenhagen

Petrovsky, A.; Azarov, E & Petrovsky, A (2008) Harmonic representation and auditory

model-based parametric matching and its application in speech/audio analysis,

AES 126th Convention, Preprint 7705, Munich, Germany

Rabiner, L & Juang, B.H (1993) Fundamentals of speech recognition, Prentice Hall, New

Jersey

Serra, X (1989) A system for sound analysis/transformation/synthesis based on a

deterministic plus stochastic decomposition, Ph.D thesis, Stanford University,

Stanford, Calif, USA

Spanias, A.S (1994) Speech coding: a tutorial review Proc of the IEEE, Vol 82, No 10, (1994)

1541-1582

Weruaga, L & Kepesi, M (2007) The fan-chirp transform for non-stationary harmonic

signals, Signal Processing, Vol 87, issue 6, (June 2007) 1-18

Zhang, F.; Bi, G & Chen Y.Q (2004) Harmonic transform, IEEE Proc.-Vis Image Signal

Process., Vol 151, No 4, (August 2004) 257-264

coefficients allow implementing fine formant tracking that can be useful in such applications

as speaker identification and speech recognition

Future work is aimed at further investigation of the analysis filters and their behaviour,

finding optimized solutions for evaluation of sinusoidal parameters It might be some

potential in adapting described methods to other applications such as vibration analyzer of

mechanical devices and diagnostics of throat diseases

9 Acknowledgments

This work was supported by the Belarusian republican fund for fundamental research

under the grant T08MC-040 and the Belarusian Ministry of Education under the grant

09-3102

10 References

Abe, T.; Kobayashi, T & Imai, S (1995) Harmonics tracking and pitch extraction based on

instantaneous frequency, Proceedings of ICASSP 1995 pp 756–759 1995

Azarov, E.; Petrovsky, A & Parfieniuk, M (2008) Estimation of the instantaneous harmonic

parameters of speech, Proceedings of the 16th European Signal Process Conf

(EUSIPCO-2008), CD-ROM, Lausanne, 2008

Boashash, B (1992) Estimating and interpreting the instantaneous frequency of a signal,

Proceedings of the IEEE, Vol 80, No 4, (1992) 520-568

Dutoit, T (1997) An Introduction to Text-to-speech Synthesis, Kluwer Academic Publishers, the

Netherlands

Gabor, D (1946) Theory of communication, Proc IEE, Vol.93, No 3, (1946) 429-457

Gianfelici, F.; Biagetti, G.; Crippa, P & Turchetti, C (2007) Multicomponent AM–FM

Representations: An Asymptotically Exact Approach, IEEE Transactions on Audio,

Speech, and Language Processing, Vol 15, No 3, (March 2007) 823-837

Griffin, D & Lim, J (1988) Multiband excitation vocoder, IEEE Trans On Acoustics, Speech

and Signal Processing, Vol 36, No 8, (1988) 1223-1235

Hahn, S L (1996) Hilbert Transforms in Signal Processing, MA: Artech House, Boston

Huang, X; Acero, A & Hon H.W (2001) Spoken language processing, Prentice Hall, New

Jersey

Levine, S & Smith, J (1998) A Sines+Transients+Noise Audio Representation for Data

Compression and Time/Pitch Scale Modifications, AES 105th Convention, Preprint

4781, San Francisco, CA, USA

Maragos, P.; Kaiser, J F & Quatieri, T F (1993) Energy Separation in Signal Modulations

with Application to Speech Analysis”, IEEE Trans On Signal Process., Vol 41, No

10, (1993) 3024-3051

Markel J.D & Gray A.H (1976) Linear prediction of speech, Springer-Verlag Berlin Heidelberg,

New York

McAulay, R J & Quatieri, T F (1986) Speech analysis/synthesis based on a sinusoidal

representation IEEE Trans On Acoustics, Speech and Signal Process., Vol 34, No 4,

(1986) 744-754

Trang 9

Music Structure Analysis Statistics for Popular Songs

Namunu C Maddage, Li Haizhou and Mohan S Kankanhalli

X

Music Structure Analysis Statistics

for Popular Songs

School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology

(RMIT) University, Swanston Street, Melbourne, 3000, Australia

1Dept of Human Language Technology, Institute for Infocomm Research,

1 Fusionopolis Way, Singapore 138632

2School of Computing, National University of Singapore, Singapore, 117417

Abstract

In this chapter, we have proposed a better procedure for manual annotation of music

information The proposed annotation procedure involves carrying out listening tests and

then incorporating music knowledge to iteratively refine the detected music information

Using this annotation technique, we can effectively compute the durations of the music

notes, time-stamp the music regions, i.e pure instrumental, pure vocal, instrumental mixed

vocals and silence, and annotate the semantic music clusters (components in a song

structure), i.e Verse -V, Chorus - C, Bridge -B, Intro, Outro and Middle-eighth

From the annotated information, we have further derived the statistics of music structure

information We conducted experiments on 420 popular songs which were sung in English,

Chinese, Indonesian and German languages We assumed a constant tempo throughout the

song and meter to be 4/4 Statistical analysis revealed that 62.46%, 35.48%, 1.87% and 0.17%

of the contents in a song belong to instrumental mixed vocal, pure instrumental, silence and

pure vocal music regions We also found over 70% of English and Indonesian songs and 30%

of Chinese songs used V-C-V-C and V-V-C-V-C song structures respectively, where V and C

denote the verse and chorus respectively It is also found that 51% of English songs, 37% of

Chinese songs, and 35% of Indonesian songs used 8 bar duration in both chorus and verse

1 Introduction

Music is a universal language people use for sharing their feelings and sensations Thus

there have been keen research interests not only to understand how music information

stimulates our minds, but also to develop applications based on music information For

example, vocal and non-vocal music information are useful for sung language recognition

systems (Tsai et at., 2004., Schwenninger et al., 2006), lyrics-text and music alignment

systems (Wang et al., 2004), mood classification systems (Lu & Zhang, 2006) music genre

classification (Nwe & Li, 2007., Tzanetakis & Cook, 2002) and music classification systems

(Xu et al., 2005., Burred & Lerch, 2004) Also, information about rhythm, harmony, melody

20

Trang 10

FigTimmemuresmeSca

1) First layer 2) Second laynotes simu3) Third layeinstrumen4) Forth laye

he pyramid diagurdain (1997) arformance, listen

g 1 Information

me information delody contours an

usic Melody is cr

sults in harmonyechanism can effeale changes or m

represents the timyer represents theultaneously;

er describes the mntal mixed vocal (

r and above repregram represents also discussed ning, understandi

grouping in the describes the rate

nd phrases whichreated when a s

y sound Psychoectively distinguimodulation of the

me information (

e harmony/melomusic regions, i.e

(IMV) and silenceesent the semanti music semantichow sound, to

ng and ecstasy le

music structure p

of information f

h create music resingle note is plaological studies ish the tones of th scale in a differe

beats, tempo, andody which is form pure vocal (PV),

e (S);

ics of the popular

cs which influenone, melody, head to our imagin

pyramid low in music Duegions are propoayed at a time Phave suggested

he diatonic scale ent section of the

d meter);

med by playing m pure instrument

r song

nce our imaginharmony, componation

urations of Harmortional to the temPlaying multiple the human cog(Burred & Lerch, song can effectiv

musical

al (PI),

ations osition,

mony / mpo of

e notes gnitive , 2004) vely be

contours and song structures (such as repetitions of chorus verse semantic regions) are

useful for developing systems for error concealment in music streaming (Wang et al., 2003),

music protection (watermarking), music summarization (Xu et al., 2005), compression, and

music search

Computer music research community has been developing algorithms to accurately extract

the information in music Many of the proposed algorithms require ground truth data for

both the parameter training process and performance evaluation For example, the

performance of a music classifier which classify the content in the music segment as vocal or

non-vocal, can be improved when the parameters of the classifier are trained with accurate

vocal and non-vocal music contents in the development dataset Also the performance of the

classifier can effectively be measured when the evaluation dataset is accurately annotated

based on the exact music composition information However it is difficult to create accurate

development and evaluation datasets because it is difficult to find information about the

music composition mainly due to copyright restrictions on sharing music information in the

public domain Therefore, the current development and evaluation datasets are created by

annotating the information that is extracted using subjective listening tests Tanghe et al.,

(2005) discussed an annotation method for drum sounds In Goto (2006)’s method, music

scenes such as beat structure, chorus, and melody line are annotated with the help of

corresponding MIDI files Li, et al., (2006) modified the general audio editing software so

that it becomes more convenient for identifying music semantic regions such as chorus The

accuracy of subjective listening test hinges on subject’s hearing competence, concentration

and music knowledge For example, it is often difficult to judge the start and end time of

vocal phrases when they are presented with strong background music If the listener’s

concentration is disturbed, then the listening continuity is lost and then it is difficult to

accurately mark the phrase boundaries However if we know the tempo and meter of the

music, then we can apply that knowledge to correct the errors of the phrase boundaries

which are detected in the listen tests

Speed of music information flow is directly proportional to tempo of the music (Authors,

1949) Therefore the duration of music regions, semantic regions, inter-beat interval, and

beat positions can be measured as multiples of music notes The proposed music

information annotation technique in this chapter, first locates the beats and onset positions

by both listening and visualizing the music signal using a graphical waveform editor Since

the time duration of the detected beat or onset from the start of music is an integer multiple

of the duration of a smallest note, we can estimate the duration of the smallest note Then

we carry out intensive listening exercise with the help of estimated durations of the smallest

music note to detect the time stamps of music regions and different semantic regions Using

the annotated information, we detect the song structure and calculate the statistics of the

music information distributions

This chapter is organized as follows Popular music structure is discussed in section 2 and

effective information annotation procedures are explained in section 3 Section 4 details the

statistics of music information We conclude the chapter in section 5 with a discussion

2 Music Structure

As shown in Fig 1, the underlying music information can conceptually be represented as

layers in a pyramid (Maddage, 2005) These information layers are:

Trang 11

FigTimmemuresmeSca

1) First layer 2) Second laynotes simu3) Third layeinstrumen4) Forth laye

he pyramid diagurdain (1997) arformance, listen

g 1 Information

me information delody contours an

usic Melody is cr

sults in harmonyechanism can effeale changes or m

represents the timyer represents theultaneously;

er describes the mntal mixed vocal (

r and above repregram represents also discussed ning, understandi

grouping in the describes the rate

nd phrases whichreated when a s

y sound Psychoectively distinguimodulation of the

me information (

e harmony/melomusic regions, i.e

(IMV) and silenceesent the semanti music semantichow sound, to

ng and ecstasy le

music structure p

of information f

h create music resingle note is plaological studies ish the tones of th scale in a differe

beats, tempo, andody which is form pure vocal (PV),

e (S);

ics of the popular

cs which influenone, melody, head to our imagin

pyramid low in music Duegions are propoayed at a time Phave suggested

he diatonic scale ent section of the

d meter);

med by playing m pure instrument

r song

nce our imaginharmony, componation

urations of Harmortional to the temPlaying multiple the human cog(Burred & Lerch, song can effectiv

musical

al (PI),

ations osition,

mony / mpo of

e notes gnitive , 2004) vely be

contours and song structures (such as repetitions of chorus verse semantic regions) are

useful for developing systems for error concealment in music streaming (Wang et al., 2003),

music protection (watermarking), music summarization (Xu et al., 2005), compression, and

music search

Computer music research community has been developing algorithms to accurately extract

the information in music Many of the proposed algorithms require ground truth data for

both the parameter training process and performance evaluation For example, the

performance of a music classifier which classify the content in the music segment as vocal or

non-vocal, can be improved when the parameters of the classifier are trained with accurate

vocal and non-vocal music contents in the development dataset Also the performance of the

classifier can effectively be measured when the evaluation dataset is accurately annotated

based on the exact music composition information However it is difficult to create accurate

development and evaluation datasets because it is difficult to find information about the

music composition mainly due to copyright restrictions on sharing music information in the

public domain Therefore, the current development and evaluation datasets are created by

annotating the information that is extracted using subjective listening tests Tanghe et al.,

(2005) discussed an annotation method for drum sounds In Goto (2006)’s method, music

scenes such as beat structure, chorus, and melody line are annotated with the help of

corresponding MIDI files Li, et al., (2006) modified the general audio editing software so

that it becomes more convenient for identifying music semantic regions such as chorus The

accuracy of subjective listening test hinges on subject’s hearing competence, concentration

and music knowledge For example, it is often difficult to judge the start and end time of

vocal phrases when they are presented with strong background music If the listener’s

concentration is disturbed, then the listening continuity is lost and then it is difficult to

accurately mark the phrase boundaries However if we know the tempo and meter of the

music, then we can apply that knowledge to correct the errors of the phrase boundaries

which are detected in the listen tests

Speed of music information flow is directly proportional to tempo of the music (Authors,

1949) Therefore the duration of music regions, semantic regions, inter-beat interval, and

beat positions can be measured as multiples of music notes The proposed music

information annotation technique in this chapter, first locates the beats and onset positions

by both listening and visualizing the music signal using a graphical waveform editor Since

the time duration of the detected beat or onset from the start of music is an integer multiple

of the duration of a smallest note, we can estimate the duration of the smallest note Then

we carry out intensive listening exercise with the help of estimated durations of the smallest

music note to detect the time stamps of music regions and different semantic regions Using

the annotated information, we detect the song structure and calculate the statistics of the

music information distributions

This chapter is organized as follows Popular music structure is discussed in section 2 and

effective information annotation procedures are explained in section 3 Section 4 details the

statistics of music information We conclude the chapter in section 5 with a discussion

2 Music Structure

As shown in Fig 1, the underlying music information can conceptually be represented as

layers in a pyramid (Maddage, 2005) These information layers are:

Trang 12

both verse and chorus are equally melodically strong Most people can hum or sing both chorus and verse A Bridge links the gap between the Verse and Chorus, and may have only two or four bars

Fig 4 Two examples for verse- chorus pattern repetitions

noticed in the listening tests Therefore, in our listening tests we detect the Middle-eighth

regions (see next section) which have a different Key from the main Key of the song

The rhythm of words can be tailored to fit into a music phrase (Authors, 1949) The vocal

regions in music comprise of words and syllables, which are uttered according to a time

signature Fig 2 shows how the words “Little Jack Horner sat in the Corner” are turn into a

rhythm, and the music notation of those words The important words or syllables in the

sentence fall onto accents to form the rhythm of the music Typically, these words are placed

at the first beat of a bar When TS is set to two Crotchet beats per bar, we see the duration of

the word “Little” is equal to two Quaver notes and the duration of the word “Jack” is equal

to a Crotchet note

Fig 2 Rhythmic flow of words

Popular song structure often contains Intro, Verse, Chorus, Bridge, Middle-eighth,

instrumental sections (INST) and Outro (Authors, 2003) As shown in Fig 1, these parts are

built on melody-based similarity regions and content-based similarity regions

Melody-based similarity regions are defined as the regions which have similar pitch contours

constructed from the chord patterns Content-based similarity regions are defined as the

regions which have both similar vocal content and melody In terms of music structure, the

Chorus sections and Verse sections in a song are considered the content-based similarity

regions and melody-based similarity regions respectively They can be grouped to form

semantic clusters as in Fig 3 For example, all the Chorus regions in a song form a Chorus

cluster, while all the Verse regions form a Verse cluster and so on

Fig 3 Semantic similarity clusters which define the structure of the popular song

A song may have an Intro of 2, 4, 8 or 16 bars long, or do not have any at all The Intro is

usually comprised of instrumental music Both Verse and Chorus are 8 or 16 bars long

Typically, the Verse is not melodically as strong as the Chorus However, in some songs

Chorus 1

Chorus 2 Chorus 3

Chorus n Chorus i

Verse 1Verse 2Verse 3

Verse nVerse k

Semantic clusters (regions) in

a popular song

Trang 13

both verse and chorus are equally melodically strong Most people can hum or sing both chorus and verse A Bridge links the gap between the Verse and Chorus, and may have only two or four bars

Fig 4 Two examples for verse- chorus pattern repetitions

noticed in the listening tests Therefore, in our listening tests we detect the Middle-eighth

regions (see next section) which have a different Key from the main Key of the song

The rhythm of words can be tailored to fit into a music phrase (Authors, 1949) The vocal

regions in music comprise of words and syllables, which are uttered according to a time

signature Fig 2 shows how the words “Little Jack Horner sat in the Corner” are turn into a

rhythm, and the music notation of those words The important words or syllables in the

sentence fall onto accents to form the rhythm of the music Typically, these words are placed

at the first beat of a bar When TS is set to two Crotchet beats per bar, we see the duration of

the word “Little” is equal to two Quaver notes and the duration of the word “Jack” is equal

to a Crotchet note

Fig 2 Rhythmic flow of words

Popular song structure often contains Intro, Verse, Chorus, Bridge, Middle-eighth,

instrumental sections (INST) and Outro (Authors, 2003) As shown in Fig 1, these parts are

built on melody-based similarity regions and content-based similarity regions

Melody-based similarity regions are defined as the regions which have similar pitch contours

constructed from the chord patterns Content-based similarity regions are defined as the

regions which have both similar vocal content and melody In terms of music structure, the

Chorus sections and Verse sections in a song are considered the content-based similarity

regions and melody-based similarity regions respectively They can be grouped to form

semantic clusters as in Fig 3 For example, all the Chorus regions in a song form a Chorus

cluster, while all the Verse regions form a Verse cluster and so on

Fig 3 Semantic similarity clusters which define the structure of the popular song

A song may have an Intro of 2, 4, 8 or 16 bars long, or do not have any at all The Intro is

usually comprised of instrumental music Both Verse and Chorus are 8 or 16 bars long

Typically, the Verse is not melodically as strong as the Chorus However, in some songs

Chorus 1

Chorus 2 Chorus 3

Chorus n Chorus i

Verse 1Verse 2Verse 3

Verse nVerse k

Semantic clusters (regions) in

a popular song

Trang 14

Fig 5 Spectral and time domain visualization of (0~3657) ms long song clip from “25 Minutes” by MLTR Quarter note length is 736.28 ms and note boundaries are highlighted using dotted lines

3.1 Computation of Inter-beat interval

Once the staff of a song is available, from the value of tempo and time signature we can calculate the duration of the beat However commercially available music albums (CDs) do not provide staff information of the songs Therefore subjects with a good knowledge of theory and practice of music have to closely examine the songs to estimate the inter-beat intervals in the song We assume all the songs have 4/4 time signature, which is the commonly used TS in popular songs (Goto, M 2001, Authors, 2003) Following the results

of music composition and structure, as discussed in section 2, we only allow the positions of beats to take place at integer multiple of smaller notes from the start point of the song Estimation of both inter-beat interval and song tempo using an iterative listening is explained below, with Fig 6 as an example

 Play the song in audio editing software which has a GUI to visualize the time domain signal with high resolution While listening to the music it is noticed that there is a steady throb to which one can clap This duration of consecutive clapping is called inter-beat interval As we assume 4/4 time signature which infers that the inter-beat interval is of quarter note length, hence four quarter notes form a bar

 As shown in Fig 6, the positions of both beats and note onsets can be effectively

visualized on the GUI, and j th position is indicated asPj By replaying the song and zooming into the areas of neighboring beats and onset positions, we can estimate the

(0 ~ 3657)ms of the song “25 Minutes -MLTR”

14000 Frequency spectrum (0~15kHz) range

Time in millisecond (ms)

Time domain signal

0 500 1000 1500 2000 2500 3000 3500 -0.5

-0.3 -0.100.10.3 0.5

Silence may also act as a Bridge between the Verse and Chorus of a song, but such cases are

rare Middle-eighth, which has 4, 8 or 16 bars in length, is an alternative version of a Verse

with a new chord progression possibly modulated by a different key Many people use the

term “Middle-eighth” and “bridge” synonymously However, the main difference is the

middle-eighth is longer (usually 16 bars) than the bridge and usually appears after the third

verse in the song There are instrumental sections in the song and they can be instrumental

versions of the Chorus, Verse, or entirely different tunes with a set of chords together

Typically INST regions have 8 or 16 bars Outro, which is the ending of the song, is usually a

fade–out of the last phrases of the chorus We have described the parts of the song which are

commonly arranged according to the simple verse-chorus and repeat pattern Two

variations on these themes are as follows:

(a) Intro, Verse 1, Verse 2, Chorus, Verse 3, Middle-eighth, Chorus, Chorus,

Outro

(b) Intro, Verse 1, Chorus, Verse 2, Chorus, Chorus, Outro

Fig 4 illustrates two examples of the above two patterns Song, “25 minutes” by MLTR

follows the pattern (a) and “Can’t Let You Go” by Mariah Carey follows the pattern (b) For

a better understanding of how artist have combined these parts to compose a song, we

conducted a survey on popular Chinese and English songs Details of the survey are

discussed in the next section

3 Music Structure Information Annotation

The fundamental step for audio content analysis is signal segmentation Within a segment,

the information can be considered quasi-stationary Feature extraction and information

modeling followed by music segmentation are the essential steps for music structure

analysis Determination of the segment size, which is suitable for extracting certain level of

information, requires better understanding of the rate of information flow in the audio data

Over three decades of speech processing research has revealed that 20-40 ms of fixed length

signal segmentation is appropriate for the speech content analysis (Rabiner & Juang, 2005)

The composition of music piece reveals the rate of information such as notes, chords, key,

vocal phrases, flow is proportional to inter-beat intervals

Fig 5 shows the quarter, eighth and sixteenth note boundaries in a song clip It can be seen

that the fluctuation of signal properties in both spectral and time domain are aligned with

those note boundaries Usually smaller notes, such as eighth, sixteenth and thirty-second

notes or smaller are played in the bars to align the harmony contours with the rhythm flow

of the lyrics and to fill the gap between lyrics (Authors, 1949) Therefore inter-beat

proportional music segmentation instead of fixed length segmentation has recently been

proposed for music content analysis (Maddage, 2004., Maddage, 2005., Wang, 2004)

Trang 15

Fig 5 Spectral and time domain visualization of (0~3657) ms long song clip from “25 Minutes” by MLTR Quarter note length is 736.28 ms and note boundaries are highlighted using dotted lines

3.1 Computation of Inter-beat interval

Once the staff of a song is available, from the value of tempo and time signature we can calculate the duration of the beat However commercially available music albums (CDs) do not provide staff information of the songs Therefore subjects with a good knowledge of theory and practice of music have to closely examine the songs to estimate the inter-beat intervals in the song We assume all the songs have 4/4 time signature, which is the commonly used TS in popular songs (Goto, M 2001, Authors, 2003) Following the results

of music composition and structure, as discussed in section 2, we only allow the positions of beats to take place at integer multiple of smaller notes from the start point of the song Estimation of both inter-beat interval and song tempo using an iterative listening is explained below, with Fig 6 as an example

 Play the song in audio editing software which has a GUI to visualize the time domain signal with high resolution While listening to the music it is noticed that there is a steady throb to which one can clap This duration of consecutive clapping is called inter-beat interval As we assume 4/4 time signature which infers that the inter-beat interval is of quarter note length, hence four quarter notes form a bar

 As shown in Fig 6, the positions of both beats and note onsets can be effectively

visualized on the GUI, and j th position is indicated asPj By replaying the song and zooming into the areas of neighboring beats and onset positions, we can estimate the

(0 ~ 3657)ms of the song “25 Minutes -MLTR”

14000 Frequency spectrum (0~15kHz) range

Time in millisecond (ms)

Time domain signal

0 500 1000 1500 2000 2500 3000 3500 -0.5

-0.3 -0.100.10.3 0.5

Silence may also act as a Bridge between the Verse and Chorus of a song, but such cases are

rare Middle-eighth, which has 4, 8 or 16 bars in length, is an alternative version of a Verse

with a new chord progression possibly modulated by a different key Many people use the

term “Middle-eighth” and “bridge” synonymously However, the main difference is the

middle-eighth is longer (usually 16 bars) than the bridge and usually appears after the third

verse in the song There are instrumental sections in the song and they can be instrumental

versions of the Chorus, Verse, or entirely different tunes with a set of chords together

Typically INST regions have 8 or 16 bars Outro, which is the ending of the song, is usually a

fade–out of the last phrases of the chorus We have described the parts of the song which are

commonly arranged according to the simple verse-chorus and repeat pattern Two

variations on these themes are as follows:

(a) Intro, Verse 1, Verse 2, Chorus, Verse 3, Middle-eighth, Chorus, Chorus,

Outro

(b) Intro, Verse 1, Chorus, Verse 2, Chorus, Chorus, Outro

Fig 4 illustrates two examples of the above two patterns Song, “25 minutes” by MLTR

follows the pattern (a) and “Can’t Let You Go” by Mariah Carey follows the pattern (b) For

a better understanding of how artist have combined these parts to compose a song, we

conducted a survey on popular Chinese and English songs Details of the survey are

discussed in the next section

3 Music Structure Information Annotation

The fundamental step for audio content analysis is signal segmentation Within a segment,

the information can be considered quasi-stationary Feature extraction and information

modeling followed by music segmentation are the essential steps for music structure

analysis Determination of the segment size, which is suitable for extracting certain level of

information, requires better understanding of the rate of information flow in the audio data

Over three decades of speech processing research has revealed that 20-40 ms of fixed length

signal segmentation is appropriate for the speech content analysis (Rabiner & Juang, 2005)

The composition of music piece reveals the rate of information such as notes, chords, key,

vocal phrases, flow is proportional to inter-beat intervals

Fig 5 shows the quarter, eighth and sixteenth note boundaries in a song clip It can be seen

that the fluctuation of signal properties in both spectral and time domain are aligned with

those note boundaries Usually smaller notes, such as eighth, sixteenth and thirty-second

notes or smaller are played in the bars to align the harmony contours with the rhythm flow

of the lyrics and to fill the gap between lyrics (Authors, 1949) Therefore inter-beat

proportional music segmentation instead of fixed length segmentation has recently been

proposed for music content analysis (Maddage, 2004., Maddage, 2005., Wang, 2004)

Trang 16

Step 3: At beat/ onset positionPj, we calculate the new note length Xj1 as below

j

j

P X

NF

Step 4: Iterate the step 1 to 3 at beat or onset positions towards the end of the songs When these iterative steps are carried out over many of the beat and onset positions towards the end of the song, the errors of the estimated note length are minimized Based on the final length estimate for the note, we can calculate the quarter note length

Fig 7 shows the variation of the estimated quarter note length for two songs Beat/onset positions nearly divide the song into equal intervals Beat/onset point zero (“0”) represents the first estimation of quarter note length The correct tempos of the songs “You are still one” and “The woman in me” are 67 BPM and 60 BPM respectively

Fig 7 Variation of estimated length of the quarter note at beat/onset points when the listening test was carried out till the end of the song

It can be seen in the Fig 7, that the deviation of the estimated quarter note is high at the beginning of the song However estimated quarter note converges to the correct value at the second half of the song (end of the song) Reason for the fluctuation of the estimated note length is explained below

As shown in Fig 6, first estimation of the note length (X1) is done using only audio-visual editing software Thus first estimation (beat/ onset point P = 0 in Fig 7) can have very high variation due to the prime difficulties of judging the correct boundaries of the notes When the song proceeds, using Eq, (1), (2), (3) and (4), we iteratively estimate the duration of the note for the corresponding beat/onset points Since beat/onset points near the start of the

song have shorter duration (P j), initial iterative estimations for the note length have higher

variation For example in Fig 6, beat/onset point P 1 is closer to start of the song and from

the Eq (1) and first estimation of note length X 1 , we compute NF 1 Eq (2) and (3) are useful

in limiting the errors in computed NF under one frame When X 1 is inaccurate and also P 1 is

short then the errors in computed number of frames NF 1 in Eq (1) have higher effects in the

next estimated note length in Eq (4) However with distant beat/onset points, i.e Pj is longer and NF j is high and more accurate, then the estimated note lengths tend to converge

You are still the one - Shania twain

Beat/onset point number

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

970 980 990 1000 1010 1020 1030

1040

The woman in me - Shania twain

Beat/onset point number

inter-beat interval and therefore the duration of the noteXj In Fig 6, we can see the

duration X1 is the first estimated eighth note

Fig 6 Estimation of music note

 After the first estimation, we establish a 4-step iterative listening process discussed

below to reduce the error between the estimates and the desired music note lengths

The constraint we apply is that beat position is equal to integer multiple of frames To

start with, first frame size is set to the estimated note, i.e Frame size = X1in Fig 6

Step 1: Set the currently estimated note length as the frame size; calculate the number of

frames NFjat an identified beat or onset position For the initialization, we set j  1

= j

j j

P NF

Step 2: As the resulting NFj is typically a floating point value, we measure the difference

between round up NFj and NFj, referred to as DNF

Trang 17

Step 3: At beat/ onset positionPj, we calculate the new note length Xj1 as below

j

j

P X

NF

Step 4: Iterate the step 1 to 3 at beat or onset positions towards the end of the songs When these iterative steps are carried out over many of the beat and onset positions towards the end of the song, the errors of the estimated note length are minimized Based on the final length estimate for the note, we can calculate the quarter note length

Fig 7 shows the variation of the estimated quarter note length for two songs Beat/onset positions nearly divide the song into equal intervals Beat/onset point zero (“0”) represents the first estimation of quarter note length The correct tempos of the songs “You are still one” and “The woman in me” are 67 BPM and 60 BPM respectively

Fig 7 Variation of estimated length of the quarter note at beat/onset points when the listening test was carried out till the end of the song

It can be seen in the Fig 7, that the deviation of the estimated quarter note is high at the beginning of the song However estimated quarter note converges to the correct value at the second half of the song (end of the song) Reason for the fluctuation of the estimated note length is explained below

As shown in Fig 6, first estimation of the note length (X1) is done using only audio-visual editing software Thus first estimation (beat/ onset point P = 0 in Fig 7) can have very high variation due to the prime difficulties of judging the correct boundaries of the notes When the song proceeds, using Eq, (1), (2), (3) and (4), we iteratively estimate the duration of the note for the corresponding beat/onset points Since beat/onset points near the start of the

song have shorter duration (P j), initial iterative estimations for the note length have higher

variation For example in Fig 6, beat/onset point P 1 is closer to start of the song and from

the Eq (1) and first estimation of note length X 1 , we compute NF 1 Eq (2) and (3) are useful

in limiting the errors in computed NF under one frame When X 1 is inaccurate and also P 1 is

short then the errors in computed number of frames NF 1 in Eq (1) have higher effects in the

next estimated note length in Eq (4) However with distant beat/onset points, i.e Pj is longer and NF j is high and more accurate, then the estimated note lengths tend to converge

You are still the one - Shania twain

Beat/onset point number

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

970 980 990 1000 1010 1020 1030

1040

The woman in me - Shania twain

Beat/onset point number

inter-beat interval and therefore the duration of the noteXj In Fig 6, we can see the

duration X1 is the first estimated eighth note

Fig 6 Estimation of music note

 After the first estimation, we establish a 4-step iterative listening process discussed

below to reduce the error between the estimates and the desired music note lengths

The constraint we apply is that beat position is equal to integer multiple of frames To

start with, first frame size is set to the estimated note, i.e Frame size = X1in Fig 6

Step 1: Set the currently estimated note length as the frame size; calculate the number of

frames NFjat an identified beat or onset position For the initialization, we set j  1

= j

j j

P NF

Step 2: As the resulting NFj is typically a floating point value, we measure the difference

between round up NFj and NFj, referred to as DNF

step 1 ELSE

Ngày đăng: 21/06/2014, 19:20

TỪ KHÓA LIÊN QUAN