1995 “The Application of Classical Information Retrieval Techniques to Spoken Documents”, PhD Thesis, University of Cambridge, Speech, Vision andRobotic Group, Cambridge, UK.. 2003 “Usin
Trang 1The flexibility of the MPEG-7 SpokenContent description makes it usable in
many different application contexts The main possible types of applications are:
• Spoken document retrieval This is the most obvious application of spoken
content metadata, already detailed in this chapter The goal is to retrieveinformation in a database of spoken documents The result of the query may
be the top-ranked relevant documents As SpokenContent descriptions include
the time locations of recognition hypotheses, the position of the retrievedquery word(s) in the most relevant documents may also be returned to the
user Mixed SpokenContent lattices (i.e combining words and phones) could
be an efficient approach in most cases
• Indexing of audiovisual data The spoken segments in the audio stream can
be annotated with SpokenContent descriptions (e.g word lattices yielded by
an LVCSR system) A preliminary audio segmentation of the audio stream isnecessary to spot the spoken parts The spoken content metadata can be used
to search particular events in a film or a video (e.g the occurrence of a queryword or sequence of words in the audio stream)
• Spoken annotation of databases Each item in a database is annotated with
a short spoken description This annotation is processed by an ASR system
and attached to the item as a SpokenContent description This metadata can then be used to search items in the database, by processing the SpokenContent
annotations with an SDR engine A typical example of such applications,already on the market, is the spoken annotation of photographs In that case,speech decoding is performed on a mobile device (integrated in the cameraitself) with limited storage and computational capacities The use of a simplephone recognizer may be appropriate
4.5.3 Perspectives
One of the most promising perspectives for the development of efficient spokencontent retrieval methods is the combination of multiple independent index
sources A SpokenContent description can represent the same spoken information
at different levels of granularity in the same lattice by merging words andsub-lexical terms
These multi-level descriptions lead to retrieval approaches that combine thediscriminative power of large-vocabulary word-based indexing with the open-vocabulary property of sub-word-based indexing, by which the problem of OOVwords is greatly alleviated As outlined in Section 4.4.6.2, some steps havealready been made in this direction However, hybrid word/sub-word-based SDRstrategies have to be further investigated, with new fusion methods (Yu and Seide,2004) or new combinations of index sources, e.g combined use of distinct types
of sub-lexical units (Lee et al., 2004) or distinct LVCSR systems (Matsushita
et al., 2004).
Trang 2REFERENCES 167
Another important perspective is the combination of spoken content with
other metadata derived from speech (Begeja et al., 2004; Hu et al., 2004).
In general, the information contained in a spoken message consists of morethan just words In the query, users could be given the possibility to searchfor words, phrases, speakers, words and speakers together, non-verbal speechcharacteristics (male/female), non-speech events (like coughing or other humannoises), etc In particular, the speakers’ identities may be of great interest forretrieving information in audio If a speaker segmentation and identificationalgorithm is applied to annotate the lattices with some speaker identifiers (stored
in SpeakerInfo metadata), this can help searching for particular events in a film
or a video (e.g sentences or words spoken by a given character in a film) The
SpokenContent descriptions enclose other types of valuable indexing information,
such as the spoken language
REFERENCES
Angelini B., Falavigna D., Omologo M and De Mori R (1998) “Basic Speech Sounds, their Analysis and Features”, in Spoken Dialogues with Computers, pp 69–121,
R De Mori (ed.), Academic Press, London
Begeja L., Renger B., Saraclar M., Gibbon D., Liu Z and Shahraray B (2004) “A System
for Searching and Browsing Spoken Communications”, HLT-NAACL 2004 Workshop
on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp 1–8, Boston,
MA, USA, May
Browne P., Czirjek C., Gurrin C., Jarina R., Lee H., Marlow S., McDonald K., Murphy N.,O’Connor N E., Smeaton A F and Ye J (2002) “Dublin City University Video Track
Experiments for TREC 2002”, NIST, 11th Text Retrieval Conference (TREC 2002),
Gaithersburg, MD, USA, November
Buckley C (1985) “Implementation of the SMART Information Retrieval System”,Computer Science Department, Cornell University, Report 85–686
Chomsky N and Halle M (1968) The Sound Pattern of English, MIT Press, Cambridge,
MA
Clements M., Cardillo P S and Miller M S (2001) “Phonetic Searching vs LVCSR:
How to Find What You Really Want in Audio Archives”, AVIOS 2001, San Jose, CA,
USA, April
Coden A R., Brown E and Srinivasan S (2001) “Information Retrieval Techniques for
Speech Applications”, ACM SIGIR 2001 Workshop “Information Retrieval Techniques for Speech Applications”.
Crestani F (1999) “A Model for Combining Semantic and Phonetic Term Similarityfor Spoken Document and Spoken Query Retrieval”, International Computer ScienceInstitute, Berkeley, CA, tr-99-020, December
Crestani F (2002) “Using Semantic and Phonetic Term Similarity for Spoken Document
Retrieval and Spoken Query Processing” in Technologies for Constructing Intelligent Systems, pp 363–376, J G.-R B Bouchon-Meunier and R R Yager (eds) Springer-
Verlag, Heidelberg, Germany
Trang 3Crestani F., Lalmas M., van Rijsbergen C J and Campbell I (1998) “ “Is This DocumentRelevant? Probably”: A Survey of Probabilistic Models in Information Retrieval”,
ACM Computing Surveys, vol 30, no 4, pp 528–552.
Deligne S and Bimbot F (1995) “Language Modelling by Variable Length Sequences:
Theoretical Formulation and Evaluation of Multigrams”, ICASSP’95, pp 169–172,
Detroit, USA
Ferrieux A and Peillon S (1999) “Phoneme-Level Indexing for Fast and
Vocabulary-Independent Voice/Voice Retrieval”, ESCA Tutorial and Research Workshop (ETRW),
“Accessing Information in Spoken Audio”, Cambridge, UK, April.
Gauvain J.-L., Lamel L., Barras C., Adda G and de Kercardio Y (2000) “The LIMSI SDR
System for TREC-9”, NIST, 9th Text Retrieval Conference (TREC 9), pp 335–341,
Gaithersburg, MD, USA, November
Glass J and Zue V W (1988) “Multi-Level Acoustic Segmentation of Continuous
Speech”, ICASSP’88, pp 429–432, New York, USA, April.
Glass J., Chang J and McCandless M (1996) “A Probabilistic Framework for
Feature-based Speech Recognition”, ICSLP’96, vol 4, pp 2277–2280, Philadelphia, PA, USA,
October
Glavitsch U and Schäuble P (1992) “A System for Retrieving Speech Documents”,
ACM, SIGIR, pp 168–176.
Gold B and Morgan N (1999) Speech and Audio Signal Processing, John Wiley &
Sons, Inc., New York
Halberstadt A K (1998) “Heterogeneous acoustic measurements and multiple classifiersfor speech recognition”, PhD Thesis, Massachusetts Institute of Technology (MIT),Cambridge, MA
Hartigan J (1975) Clustering Algorithms, John Wiley & Sons, Inc., New York.
Hu Q., Goodman F., Boykin S., Fish R and Greiff W (2004) “Audio Hot Spotting and
Retrieval using Multiple Features”, HLT-NAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp 13–17, Boston, MA, USA, May.
James D A (1995) “The Application of Classical Information Retrieval Techniques
to Spoken Documents”, PhD Thesis, University of Cambridge, Speech, Vision andRobotic Group, Cambridge, UK
Jelinek F (1998) Statistical Methods for Speech Recognition, MIT Press, Cambridge,
MA
Johnson S E., Jourlin P., Spärck Jones K and Woodland P C (2000) “Spoken Document
Retrieval for TREC-9 at Cambridge University”, NIST, 9th Text Retrieval Conference (TREC 9), pp 117–126, Gaithersburg, MD, USA, November.
Jones G J F., Foote J T., Spärk Jones K and Young S J (1996) “Retrieving
Spo-ken Documents by Combining Multiple Index Sources”, ACM SIGIR’96, pp 30–38,
Zurich, Switzerland, August
Katz S M (1987) “Estimation of Probabilities from Sparse Data for the Language Model
Component of a Speech Recognizer”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol 3, pp 400–401.
Kupiec J., Kimber D and Balasubramanian V (1994) “Speech-based Retrieval using
Semantic Co-Occurrence Filtering”, ARPA, Human Language Technologies (HLT) Conference, pp 373–377, Plainsboro, NJ, USA.
Larson M and Eickeler S (2003) “Using Syllable-based Indexing Features and Language
Models to Improve German Spoken Document Retrieval”, ISCA, Eurospeech 2003,
pp 1217–1220, Geneva, Switzerland, September
Trang 4REFERENCES 169
Lee S W., Tanaka K and Itoh Y (2004) “Multi-layer Subword Units for
Open-Vocabulary Spoken Document Retrieval”, ICSLP’2004, Jeju Island, Korea, October.
Levenshtein V I (1966) “Binary Codes Capable of Correcting Deletions, Insertions and
Reversals”, Soviet Physics Doklady, vol 10, no 8, pp 707–710.
Lindsay A T., Srinivasan S., Charlesworth J P A., Garner P N and Kriechbaum W
(2000) “Representation and linking mechanisms for audio in MPEG-7”, Signal Processing: Image Communication Journal, Special Issue on MPEG-7, vol 16,
pp 193–209
Logan B., Moreno P J and Deshmukh O (2002) “Word and Sub-word Indexing
Approaches for Reducing the Effects of OOV Queries on Spoken Audio”, Human Language Technology Conference (HLT 2002), San Diego, CA, USA, March.
Matsushita M., Nishizaki H., Nakagawa S and Utsuro T (2004) “Keyword tion and Extraction by Multiple-LVCSRs with 60,000 Words in Speech-driven WEB
Recogni-Retrieval Task”, ICSLP’2004, Jeju Island, Korea, October.
Moreau N., Kim H.-G and Sikora T (2004a) “Combination of Phone N-Grams for
a MPEG-7-based Spoken Document Retrieval System”, EUSIPCO 2004, Vienna,
Austria, September
Moreau N., Kim H.-G and Sikora T (2004b) “Phone-based Spoken Document Retrieval
in Conformance with the MPEG-7 Standard”, 25th International AES Conference
“Metadata for Audio”, London, UK, June.
Moreau N., Kim H.-G and Sikora T (2004c) “Phonetic Confusion Based Document
Expansion for Spoken Document Retrieval”, ICSLP Interspeech 2004, Jeju Island,
Korea, October
Morris R W., Arrowood J A., Cardillo P S and Clements M A (2004) “Scoring
Algo-rithms for Wordspotting Systems”, HLT- NAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp 18–21, Boston, MA, USA, May.
Ng C., Wilkinson R and Zobel J (2000) “Experiments in Spoken Document Retrieval
Using Phoneme N-grams”, Speech Communication, vol 32, no 1, pp 61–77.
Ng K (1998) “Towards Robust Methods for Spoken Document Retrieval”, ICSLP’98,
vol 3, pp 939–342, Sydney, Australia, November
Ng K (2000) “Subword-based Approaches for Spoken Document Retrieval”, PhD Thesis,Massachusetts Institute of Technology (MIT), Cambridge, MA
Ng K and Zue V (1998) “Phonetic Recognition for Spoken Document Retrieval”,
ICASSP’98, pp 325–328, Seattle, WA, USA.
Ng K and Zue V W (2000) “Subword-based Approaches for Spoken Document
Retrieval”, Speech Communication, vol 32, no 3, pp 157–186.
Paul D B (1992) “An Efficient A∗ Stack Decoder Algorithm for Continuous
Speech Recognition with a Stochastic Language Model”, ICASSP’92, pp 25–28, San
Francisco, USA
Porter M (1980) “An Algorithm for Suffix Stripping”, Program, vol 14, no 3,
pp 130–137
Rabiner L (1989) “A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition”, Proceedings of the IEEE, vol 77, no 2, pp 257–286.
Rabiner L and Juang B.-H (1993) Fundamentals of Speech Recognition, Prentice Hall,
Englewood Cliffs, NJ
Robertson E S (1977) “The probability ranking principle in IR”, Journal of tation, vol 33, no 4, pp 294–304.
Trang 5Documen-Rose R C (1995) “Keyword Detection in Conversational Speech Utterances Using
Hidden Markov Model Based Continuous Speech Recognition”, Computer, Speech and Language, vol 9, no 4, pp 309–333.
Salton G and Buckley C (1988) “Term-Weighting Approaches in Automatic Text
Retrieval”, Information Processing and Management, vol 24, no 5, pp 513–523 Salton G and McGill M J (1983) Introduction to Modern Information Retrieval,
McGraw-Hill, New York
Srinivasan S and Petkovic D (2000) “Phonetic Confusion Matrix Based Spoken
Doc-ument Retrieval”, 23rd Annual ACM Conference on Research and Development in Information Retrieval (SIGIR’00), pp 81–87, Athens, Greece, July.
TREC (2001) “Common Evaluation Measures”, NIST, 10th Text Retrieval Conference (TREC 2001), pp A–14, Gaithersburg, MD, USA, November.
van Rijsbergen C J (1979) Information Retrieval, Butterworths, London.
Voorhees E and Harman D K (1998) “Overview of the Seventh Text REtrieval
Confer-ence”, NIST, 7th Text Retrieval Conference (TREC-7), pp 1–24, Gaithersburg, MD,
USA, November
Walker S., Robertson S E., Boughanem M., Jones G J F and Spärck Jones K (1997)
“Okapi at TREC-6 Automatic Ad Hoc, VLC, Routing, Filtering and QSDR”, 6th Text Retrieval Conference (TREC-6), pp 125–136, Gaithersburg, MD, USA, November.
Wechsler M (1998) “Spoken Document Retrieval Based on Phoneme Recognition”, PhDThesis, Swiss Federal Institute of Technology (ETH), Zurich
Wechsler M., Munteanu E and Schäuble P (1998) “New Techniques for
Open-Vocabulary Spoken Document Retrieval”, 21st Annual ACM Conference on Research and Development in Information Retrieval (SIGIR’98), pp 20–27, Melbourne,
Australia, August
Wells J C (1997) “SAMPA computer readable phonetic alphabet”, in Handbook of Standards and Resources for Spoken Language Systems, D Gibbon, R Moore and
R Winski (eds), Mouton de Gruyter, Berlin and New York
Wilpon J G., Rabiner L R and Lee C.-H (1990) “Automatic Recognition of Keywords
in Unconstrained Speech Using Hidden Markov Models”, Transactions on Acoustics, Speech and Signal Processing, vol 38, no 11, pp 1870–1878.
Witbrock M and Hauptmann A G (1997) “Speech Recognition and Information
Retrieval: Experiments in Retrieving Spoken Documents”, DARPA Speech Recognition Workshop, Chantilly, VA, USA, February.
Yu P and Seide F T B (2004) “A Hybrid Word/Phoneme-Based Approach for Improved
Vocabulary-Independent Search in Spontaneous Speech”, ICSLP’2004, Jeju Island,
Korea, October
Trang 6Music Description Tools
The purpose of this chapter is to outline how music and musical signals can
be described Several MPEG-7 high-level tools were designed to describe theproperties of musical signals Our prime goal is to use these descriptors tocompare music signals and to query for pieces of music
The aim of the MPEG-7 Timbre DS is to describe some perceptual features
of musical sounds with a reduced set of descriptors These descriptors relate
to notions such as “attack”, “brightness” or “richness” of a sound The Melody
DS is a representation for melodic information which mainly aims at ing efficient melodic similarity matching The musical Tempo DS is defined to
facilitat-characterize the underlying temporal structure of musical sounds In this chapter
we focus exclusively on MPEG-7 tools and applications We outline how tance measures can be constructed that allow queries for music based on theMPEG-7 DS
dis-5.1 TIMBRE
5.1.1 Introduction
In music, timbre is the quality of a musical note which distinguishes differenttypes of musical instrument, see (Wikipedia, 2001) The timbre is like a formant
in speech; a certain timbre is typical for a musical instrument This is why, with
a little practice, it is possible for human beings to distinguish a saxophone from
a trumpet in a jazz group or a flute from a violin in an orchestra, even if theyare playing notes at the same pitch and amplitude Timbre has been called thepsycho-acoustician’s waste-basket as it can include so many factors
Though the phrase tone colour is often used as a synonym for timbre, colours ofthe optical spectrum are not generally explicitly associated with particular sounds.Rather, the sound of an instrument may be described with words like “warm” or
MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval H.-G Kim, N Moreau and T Sikora
Trang 7“harsh” or other terms, perhaps suggesting that tone colour has more in commonwith the sense of touch than of sight People who experience synaesthesia,however, may see certain colours when they hear particular instruments.Two sounds with similar physical characteristics like pitch and loudness may
have different timbres The aim of the MPEG-7 Timbre DS is to describe
per-ceptual features with a reduced set of descriptors
MPEG-7 distinguishes four different families of sounds:
• Harmonic sounds
• Inharmonic sounds
• Percussive sounds
• Non-coherent sounds
These families are characterized using the following features of sounds:
• Harmony: related to the periodicity of a signal, distinguishes harmonic frominharmonic and noisy signals
• Sustain: related to the duration of excitation of the sound source, distinguishessustained from impulsive signals
• Coherence: related to the temporal behaviour of the signal’s spectral nents, distinguishes spectra with prominent components from noisy spectra.The four sound families correspond to these characteristics, see Table 5.1 Pos-sible target applications are, following the standard (ISO, 2001a):
compo-• Authoring tools for sound designers or musicians (music sample databasemanagement) Consider a musician using a sample player for music production,playing the drum sounds of in his or her musical recordings Large libraries
of sound files for use with sample players are already available The MPEG-7
Timbre DS could be facilitated to find percussive sounds in such a library
which matches best the musician’s idea for his or her production
• Retrieval tools for producers (query-by-example (QBE) search based on ceptual features) If a producer wants a certain type of sound and already has
per-Table 5.1 Sound families and sound characteristics (from ISO, 2001a)
Sound family Harmonic Inharmonic Percussive Non-coherentCharacteristics Sustained Sustained Impulsive Sustained
Harmonic Inharmonic
Example Violin, flute Bell, triangle Snare, claves Cymbals
Harmonic- Timbre
Instrument- Instrument- Timbre
Trang 8Percussive-5.1 TIMBRE 173
a sample sound, the MPEG-7 Timbre DS provides the means to find the most
similar sound in a sound file of a music database Note that this problem isoften referred to as audio fingerprinting
All descriptors of the MPEG-7 Timbre DS use the low-level timbral
descrip-tors already defined in Chapter 2 of this book The following sections describe
the high-level DS InstrumentTimbre, HarmonicInstrumentTimbre and siveInstrumentTimbre.
Percus-5.1.2 InstrumentTimbre
The structure of the InstrumentTimbre is depicted in Figure 5.1 It is a set of
tim-bre descriptors in order to describe timtim-bres with harmonic and percussive aspects:
• LogAttackTime (LAT), the LogAttackTime descriptor, see Section 2.7.2.
• HarmonicSpectralCentroid (HSC), the HarmonicSpectralCentroid descriptor,
see Section 2.7.5
• HarmonicSpectralDeviation (HSD), the HarmonicSpectralDeviation
descrip-tor, see Section 2.7.6
• HarmonicSpectralSpread (HSS), the HarmonicSpectralSpread descriptor, see
Section 2.7.7
Figure 5.1 The InstrumentTimbre: + signs at the end of a field indicate furtherstructured content; – signs mean unfold content;· · · indicate a sequence (from Manjunath
et al., 2002)
Trang 9• HarmonicSpectralVariation (HSV), the HarmonicSpectralVariation
descrip-tor, see Section 2.7.8
• SpectralCentroid (SC), the SpectralCentroid descriptor, see Section 2.7.9.
• TemporalCentroid (TC), the TemporalCentroid descriptor, see Section 2.7.3.
har-monic and percussive features The following listing represents a harp using the
InstrumentTimbre It is written in MPEG-7 XML syntax, as mentioned in the
Figure 5.2 shows the HarmonicInstrumentTimbre It holds the following set of
timbre descriptors to describe the timbre perception among sounds belonging tothe harmonic sound family, see (ISO, 2001a):
• LogAttackTime (LAT), the LogAttackTime descriptor, see Section 2.7.2.
• HarmonicSpectralCentroid (HSC), the HarmonicSpectralCentroid descriptor,
see Section 2.7.5
Trang 105.1 TIMBRE 175
Figure 5.2 The HarmonicInstrumentTimbre (from Manjunath et al., 2002)
• HarmonicSpectralDeviation (HSD), the HarmonicSpectralDeviation
descrip-tor, see Section 2.7.6
• HarmonicSpectralSpread (HSS), the HarmonicSpectralSpread descriptor, see
Section 2.7.7
• HarmonicSpectralVariation (HSV), the HarmonicSpectralVariation
descrip-tor, see Section 2.7.8
Trang 11Figure 5.3 The PercussiveInstrumentTimbre (from Manjunath et al., 2002)
5.1.4 PercussiveInstrumentTimbre
The PercussiveInstrumentTimbre depicted in Figure 5.3 can describe impulsive
sounds without any harmonic portions To this end it includes:
• LogAttackTime (LAT), the LogAttackTime descriptor, see Section 2.7.2.
• SpectralCentroid (SC), the SpectralCentroid descriptor, see Section 2.7.9.
• TemporalCentroid (TC), the TemporalCentroid descriptor, see Section 2.7.3.
Trang 12The MPEG-7 Melody DS provides a rich representation for monophonic melodic
information to facilitate efficient, robust and expressive melodic similarity ing
match-The term melody denotes a series of notes or a succession, not a simultaneity
as in a chord, see (Wikipedia, 2001) However, this succession must containchange of some kind and be perceived as a single entity (possibly gestalt) to becalled a melody More specifically, this includes patterns of changing pitches anddurations, while more generally it includes any interacting patterns of changingevents or quality
What is called a “melody” depends greatly on the musical genre Rock musicand folk songs tend to concentrate on one or two melodies, verse and chorus.Much variety may occur in phrasing and lyrics In western classical music,composers often introduce an initial melody, or theme, and then create variations.Classical music often has several melodic layers, called polyphony, such asthose in a fugue, a type of counterpoint Often melodies are constructed frommotifs or short melodic fragments, such as the opening of Beethoven’s NinthSymphony Richard Wagner popularized the concept of a leitmotif: a motif ormelody associated with a certain idea, person or place
For jazz music a melody is often understood as a sketch and widely changed
by the musicians It is more understood as a starting point for improvization.Indian classical music relies heavily on melody and rhythm, and not so much onharmony as the above forms A special problem arises for styles like Hip Hopand Techno This music often presents no clear melody and is more related torhythmic issues Moreover, rhythm alone is enough to picture a piece of music,
e.g a distinct percussion riff, as mentioned in (Manjunath et al., 2002) Jobim’s
famous “One Note Samba” is an nice example where the melody switchesbetween pure rhythmical and melodic features
5.2.1 Melody
The structure of the MPEG-7 Melody is depicted in Figure 5.4 It contains
information about meter, scale and key of the melody The representation
Trang 13Figure 5.4 The MPEG-7 Melody (from Manjunath et al., 2002)
of the melody itself resides inside either the fields MelodyContour or MelodySequence.
Besides the optional field Header there are the following entries:
• Meter: the time signature is held in the Meter (optional).
• Scale: in this array the intervals representing the scale steps are held (optional).
• Key: a container containing degree, alteration and mode (optional).
• MelodyContour: a structure of MelodyContour (choice).
• MelodySequence: a structure of MelodySequence (choice).
All these fields and necessary MPEG-7 types will be described in more detail inthe following sections
5.2.2 Meter
The field Meter contains the time signature It specifies how many beats are in
each bar and which note value constitutes one beat This is done using a fraction:the numerator holds the number of beats in a bar, the denominator contains thelength of one beat For example, for the time signature 4/4 each beat containsthree quarter notes The most common time signatures in western music are 4/4,3/4 and 2/4
The time signature also gives information about the rhythmic subdivision ofeach bar, e.g a 4/4 meter is stressed on the first and third bar by convention Forunusual rhythmical patterns in music complex signatures like 3+ 2 + 3/8 are given.Note that this cannot be represented exactly by MPEG-7 (see example next page)
Trang 145.2 MELODY 179
Figure 5.5 The MPEG-7 Meter (from Manjunath et al., 2002)
The Meter is shown in Figure 5.5 It is defined by:
• Numerator: contains values from 1 to 128.
• Denominator: contains powers of 2: 20 27, e.g 1 2 128
MPEG-7 Complex signatures like 3+ 2 + 3/8 have to be defined in a simplifiedmanner like 8/8
The Scale descriptor contains a list of intervals representing a sequence of
intervals dividing the octave The intervals result in a list of frequencies givingthe pitches of the single notes of the scale In traditional western music, scalesconsist of seven notes, made up of a root note and six other scale degrees whosepitches lie between the root and its first octave Notes in the scale are separated
by whole and half step intervals of tones and semitones, see (Wikipedia, 2001).There are a number of different types of scales commonly used in western
music, including major, minor, chromatic, modal, whole tone and pentatonic scales There are also synthetic scales like the diminished scales (also known as octatonic), the altered scale, the Spanish and Jewish scales, or the Arabic scale.
The relative pitches of individual notes in a scale may be determined by one
of a number of tuning systems Nowadays, in most western music, the equal temperament is the most common tuning system Starting with a pitch at F0, thepitch of noten can be calculated using:
Trang 15Figure 5.6 The MPEG-7 Scale It is a simple vector of float values From (Manjunath
an example below
Also mentioned in the MPEG-7 standard is the Bohlen–Pierce (BP) scale,
a non-traditional scale containing 13 notes It was independently developed in
1972 by Heinz Bohlen, a microwave electronics and communications engineer,and later by John Robinson Pierce,1 also a microwave electronics and commu-nications engineer! See the examples for more details
The information of the Scale descriptor may be helpful for reference purposes The structure of the Scale is a simple vector of floats as shown in Figure 5.6:
• Scale: the vector contains the parameter n of Equation (5.3) Using the whole
numbers 1–12 results in the equal temperated chromatic scale, which is also
the default of the Scale vector If a number of frequencies fn of pitchesbuilding a scale are given, the values scale(n) of the Scale vector can be
calculated using:
scalen = 12 log2
fn
F0
temperature, is simply represented as: