This report is to develop the idea which makes an automatic detector system to detect the exciting events directly from the original movie using only the audio signal.. 20 GRAPH 3-5: AUD
Trang 1DUBLIN CITY UNIVERSITY SCHOOL OF ELECTRONIC ENGINEERING
Detection of Interesting Events in Movies using
only the Audio signal
PHAM MINH LUAN NGUYEN
Trang 2Acknowledgements
I would like to thank my supervisor Dr Sean Marlow for his extensive guidance, enthusiasm and commitment to this project Thanks also due to Dr David Sadlier for supporting movies and codes Thanks also to all other friends/colleagues for their contribution to the establishment
Declaration
I hereby declare that, except where otherwise indicated, this document is entirely my own
work and has not been submitted in whole or in part to any other university
Signed: Date:
Trang 3Abstract
The imminent rapid expansion in the movie industry is driving the need for efficient digital video indexing, browsing and playback systems This report is to develop the idea which makes an automatic detector system to detect the exciting events directly from the original movie using only the audio signal Interesting events in movies are typically flagged by high audio amplitude Detection of these events based on the audio amplitude is an efficient method It is a fast detection method, which takes advantage of the fact that audio features are computationally cheaper than the visual features Then the highlight events are classified
to evaluate the automatic system
Trang 4Contents ACKNOWLEDGEMENTS II
DECLARATION II
ABSTRACT III
CONTENTS IV
LIST OF FIGURES VI
LIST OF GRAPHS VII
LIST OF TABLES IX
CHAPTER 1 -INTRODUCTION 1
1.1 R ELATED WORK 2
1.1.1 Automatically Selecting Shots for Action Movie Trailers 2
1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection 3
1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream 4
1.2 E XCITING EVENT DETECTION IN MOVIE USING AUDIO SIGNAL 5
CHAPTER 2 – MPEG-1 AUDIO/VIDEO STANDARD 6
2.1 O VERVIEW 6
2.2 MPEG-1 LAYER 2 A UDIO 7
CHAPTER 3 – MOVIE HIGHLIGHT DETECTION 10
3.1 G ETTING G ROUND T RUTH 10
3.2 A UTOMATIC D ETECTION 15
3.2.1 Getting Scale Factor 16
3.2.2 Audio amplitude threshold 19
CHAPTER 4 – RESULTS AND ANALYSIS 36
4.1 R ESULTS 36
4.1.1 The average audio amplitude 36
4.1.2 The audio amplitude threshold time 36
4.1.3 Results and result tables 36
4.2 P RECISION AND R ECALL 44
CHAPTER 5 - CONCLUSIONS AND FURTHER WORK 45
Trang 55.1 S YSTEM E VALUATION 45 5.2 F URTHER WORK 46
REFERENCES 48
Trang 6List of Figures
FIGURE 2-1: ISO/MPEG-1 LAYER I/II ENCODER 7
FIGURE 2-2: STRUCTURE OF LAYER – II SUBBAND SAMPLES 9
FIGURE 2-3: THE DATA BITSTREAM STRUCTURE OF LAYER - II 9
FIGURE 3-1: MPEG-1 LAYER-II FREQUENCY SUBBANDS 16
FIGURE 3-2: VIDEO FRAME AUDIO LEVELS GENERATED FROM SCALEFACTORS CORRESPODING TO TEMPORALLY ASSOCIATED AUDIO 18
Trang 7List of Graphs
GRAPH 3-1: PER-FRAME AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE 17
GRAPH 3-2: PER-SECOND AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE 18
GRAPH 3-3: AUDIO AMPLITUDE PROFILE OF THE NIGHT AT THE MUSEUM 2 20
GRAPH 3-4: AUDIO AMPLITUDE DETECTION OF THE NIGHT AT THE MUSEUM 2 20
GRAPH 3-5: AUDIO AMPLITUDE DETECTION OF THE NIGHT AND THE MUSEUM 2 AND GROUND TRUTH (BLUE IS AUTOMATIC DETECTION RED IS THE GROUND TRUTH) 20
GRAPH 3-6: AUDIO AMPLITUDE PROFILE OF THE KINGDOM 21
GRAPH 3-7: AUDIO AMPLITUDE DETECTION OF THE KINGDOM 21
GRAPH 3-8: AUDIO AMPLITUDE DETECTION OF THE KINGDOM AND GROUND TRUTH 21
GRAPH 3-9: AUDIO AMPLITUDE PROFILE OF THE LEGEND OF BUTCH AND SUNDANCE 22
GRAPH 3-10: AUDIO AMPLITUDE DETECTION OF THE LEGEND OF BUTCH AND SUNDANCE 22
GRAPH 3-11: COMPARE RESULT AUTOMATIC DETECTION AND GROUND TRUTH 22
GRAPH 3-12: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - ONE FRAME) 24
GRAPH 3-13: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – ONE FRAME) 24
GRAPH 3-14: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – TWO FRAMES) 25
GRAPH 3-15: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO FRAMES) 25
GRAPH 3-16: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - TWO SECONDS) 26
GRAPH 3-17: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO SECONDS) 26
GRAPH 3-18: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – FOUR SECONDS) 27
GRAPH 3-19: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – FOUR SECONDS) 27
GRAPH 3-20: AUDIO AMPLITUDE PROFILE (THE KINGDOM – ONE FRAME) 28
GRAPH 3-21: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – ONE FRAME) 28
GRAPH 3-22: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO FRAMES) 29
GRAPH 3-23: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO FRAMES) 29
GRAPH 3-24: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO SECONDS) 30
GRAPH 3-25: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO SECONDS) 30
GRAPH 3-26: AUDIO AMPLITUDE PROFILE (THE KINGDOM – FOUR SECONDS) 31
GRAPH 3-27: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – FOUR SECONDS) 31
GRAPH 3-28: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – ONE FRAME) 32
Trang 8GRAPH 3-29 AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – ONE FRAME) 32 GRAPH 3-30: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO FRAMES) 33 GRAPH 3-31: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – TWO FRAMES) 33 GRAPH 3-32: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO SECONDS) 34 GRAPH 3-33: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – TWO SECONDS) 34 GRAPH 3-34: AUDIO AMPLITUDE PROFILE ((THE LEGEND OF BUTCH AND SUNDANCE – FOUR SECONDS) 35 GRAPH 3-35: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – FOUR SECONDS) 35
Trang 9List of Tables
TABLE 3-1: GROUND TRUTH OF NIGHT AT THE MUSEUM 2 11
TABLE 3-2: GROUND TRUTH OF THE KINGDOM 12
TABLE 3-3: GROUND TRUTH OF THE KINGDOM (CONTINUE) 13
TABLE 3-4: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE 13
TABLE 3-5: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE (CONTINUE) 14
TABLE 4-1: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH 38
TABLE 4-2: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM 38
TABLE 4-3: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM 39
TABLE 4-4: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH 40
TABLE 4-5: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM 41
TABLE 4-6: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH 42
TABLE 4-7: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM 43
TABLE 4-8: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM 43
TABLE 4-9: PRECISION AND RECALL VALUES FOR THREE MOVIES 44
Trang 10Chapter 1 -Introduction
The growing availability of video content creates a strong requirement for efficient tools to manage or access multimedia data [3] Considerable progress has been made in audio analysis for movie content with automatic highlight detection being one of the targets of recent research Highlight detection is important, since they provide the user with a short version of the movie that ideally contains all important information for understanding the content Hence, the user may quickly evaluate the movie as interesting or not
Audio, which includes voice, music, and various kinds of environmental sounds, is an important type of media, and also a significant part of audiovisual data However, since there are more and more digital audio databases in place these days, people are realizing the importance of effective management for audio databases relying on audio content analysis
Audio segmentation and classification have applications in professional media production, audio archive management, commercial music usage, surveillance, and so on Furthermore, audio content analysis may play a primary role in video annotation Current approaches for video segmentation and indexing are mostly focused on the visual information However, visual – based processing often leads to a far too fine segmentation of the audiovisual sequence with respect to the diverse multimedia components (audio, visual, and textual information) will be essential in achieving a fully functional system for video parsing
Existing research on content – based on audio data management is very limited There are in general four directions [6] One direction is audio segmentation and classification One basic problem is speech/music discrimination The second direction is audio retrieval One specific technique in content-based audio retrieval is query-by-humming The third direction
is audio analysis for video indexing The fourth direction is the integration of audio and visual information for video segmentation and indexing
Trang 111.1 Related work
1.1.1 Automatically Selecting Shots for Action Movie Trailers
Alan F Smeaton, Bart Lehane, Noel E O’Connor, Conor Brady and Gary Craig of Dublin City University, Ireland have researched into the area of the movie highlights [3] Their study was based on the following principles:
• They utilise a shot boundary technique in order to generate the basic shot-based structure of a movie Colour histograms have been demonstrated as a highly accurate and efficient method of comparing images and detecting shot boundaries
• The audio track of a movie is analysed in order to detect the presence of the following categories: speech, music, silence, speech with background music and other audio Their rationale for using these audio categories is that music can be indicative of high,
or low, points of a movie
• For each shot they also detect two motion features, the motion intensity and the percentage of camera movement present The motion intensity is an indicator of the amount of motion within each frame of video, and is determined by calculating the standard deviation of the motion vectors
The features used in order to detect shots used in trailers are shot length, motion intensity, and the amount of camera movement, speech, music, silence, speech with background music and other audio present in each shot Evaluation of the performance of their shot selection used the classic measures of precision and recall where a set of shots selected using their trained approach was compared against the ground truth of shots which appear in the official movie trailer Their approach to using SVM (support vector machines) selects shots in rank order based on their likelihood for inclusion in the original trailer and the specific metric they use for evaluation is R−Precision [14] Given a ranked list produced as the output of a system to be evaluated, R–Precision is defined as the precision at rank position R, where R
is the number of document or objects relevant to the query
Trang 12When evaluating shot selection they face the issue of how to evaluate sub-shot retrieval One approach they could take to address this is to evaluate based on the proportion of frames from the original movie which appear in the trailer and this would correspond to the way gradual shot transitions are evaluated in TRECVid [13] using frame − precision and frame − recall where the evaluation is in terms of the number of overlapped frames
Evaluation of their approach to trailer shot selection was done using a leave-one-out k-fold cross validation This is a technique used in information retrieval in which a dataset, T, is divided into training T1 and testing T2 subsets, T =T1+T2, training is done on T1 and testing on T2, and then T is re-divided into different training and testing subsets T1′ and T2′ and the training and evaluation is repeated, a total of k times
The results show several interesting aspects Firstly, the consistently high results indicate that this approach of selecting shots for action movie trailers is both accurate and reliable One possible danger with our results is that their accuracy could be biased by the use of automatic shot segmentation A correct classification of a movie trailer shot occurs when the ground-truth trailer sub-shot occurs within the selected movie full-shot
Three event classes were chosen (exciting, dialogue and musical) that typically encapsulate all relevant portions of a movie A range of low-level audiovisual features were extracted and finite state machines were used in order to detect the events
1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection
This study was done by Seán Marlow, David A Sadlier, Noel O’Connor, Noel Murphy of Dublin City University, Ireland [4] This report uses the Sport program which is supported
by the Centre for Digital Video Processing at DCU This report focuses the audio to do highlight detection in Sport Program The author used some features of the Audio MPEG-1 Layer II and features of the audio in Sport Program The audio in a sport program has a feature that gets high audio amplitude when an exciting event happens in program, i.e goal
in football match, penalty offence, Red Card offence In this report, the author focuses the audio amplitude to highlight detection through the Scale Factor in the Audio MPEG -1 Layer II The principle is the audio amplitude threshold The Scale Factor was stripped from the audio then it was processed to get amplitude level in one frame The method detected in this report that detected three audio-amplitude-frames higher than the amplitude threshold
Trang 13fast way This report’s result had detected almost the highlight events in the Sport Program This method was successful in locating the presence of highlight event and the boundary of the events
Their work is a preliminary investigation into the usefulness of pure audio analysis for summarisation of (limited types of) sports programmes A further eight 10-minute summaries were generated from various other broadcast sports programmes The content of returned clips, make up the final summary
In a real scenario, automatic summarisation of such broadcasts would depend on some combination of an analysis of the closed captions (teletext), and analysis at the visual level
1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream
This project is a research by David A Sadlier, Noel O’Connor, Sean Marlow, Noel Murphy [5] The research is concerned the TV advertisements A television programme is typically accompanied by beginning/and credits with one or more ad-breaks somewhere in the middle To the user, these features of a programme would be generally regarded as an insignificant part of the material Their study was based on the following principles:
• Black Video Frame Detection: a black video frame may be recognised by its luminance histogram, which would be typically characterised by having most of its ‘power’ at the bottom end of pixel amplitude spectrum, corresponding to black or very dark pixels
• Silent Video Frame Detection: A summation of the absolute value of all the individual audio samples corresponding to the temporal length of one video frame may be defined
as the ‘audio level’ for that frame, i.e for a video frame with relatively quite audio, a slow audio level would be expected Thus, by threshold this audio level, silent video frames (of intensity defined by threshold) may be detected
The authors report that black/silent video frame series may indicate the existence of an break However, they use another element which is some features of the advertisement breaks There are the length of the advertisement breaks and the frame number between two advertisement breaks
Trang 14ad-1.2 Exciting event detection in movie using audio signal
We also have some cases to study about event detection and movie detection The first case, they had detected events in movie by using the audiovisual data [3] The second case, they use the audio signal to highlight events in the sport program [4] The third case, they use the audiovisual data to detect the ad-break in a television program [5] However, they have not
to detect the events in movie using the only audio signal
The method uses the audio signal to highlight events in movie is the cheaper way It does not have too much time to calculate as the audiovisual data method In this document, we choose a figure of the audio signal to highlight event in movie This is the audio amplitude The audio amplitude in movie is one indicator of exciting events The exciting events usually happen with high audio amplitude in movies The high audio amplitude events may
be the gunshot event, fighting events, crash events, or explosion events So the audio amplitude may be helpful to highlight the events
Trang 15Chapter 2 – MPEG-1 Audio/Video Standard
2.1 Overview
The Moving Pictures Experts Group (MPEG) [15] who meet under the International Standards Organisation (ISO), generate international standards for digital video and audio compression MPEG-1 is a standard in five parts:
1 ISO/IEC 11172-1:1993
This addresses problem of combining one or more data stream from the video and audio parts of the MPEG-1 standard with timing information to form a single stream i.e multiplexing and synchronisation of audio/video
Technically not a standard, but a technical report Gives a full software implementation
of the first three parts of the MPEG-1 standard
Trang 162.2 MPEG-1 layer 2 Audio
MPEG-1 audio standard (ISO/IEC 1172-3) comprises a flexible hybrid coding technique that incorporates several methods including subband decomposition, filter-bank analysis, transform coding, entropy coding, dynamic bit allocation, nonuniform quatization, adaptive segmentation, and psychoacoustic analysis MPEG-1 audio codec operates on 16-bit PCM input data at samples rates of 32, 44.1 and 48 kHz Moreover, MPEG-1 offers separate modes for mono, stereo, dual independent mono and joint stereo Available bit rates are 32 -
192 kb/s for mono and 64-384 kb/s for stereo
The MPEG-1 architecture contains three layers of increasing complexity, delay and output quality Each higher layer incorporates functional blocks from the lower layers The input signal is first decomposed into 32 critically subsampled subbands using a polyphase realization of a pseudo-QMF( (PQMF) bank The channels are equally spaced such that a 48-kHz input signal is split into 750-Hz subbands, with the subbands decimated 32:1 A
511th-order prototype filter was chosen such that the inherent overall PQMF distortion remains below the threshold of the audibility Moreover, the prototype filter was designed for high sidelobe attenuation (96dB) to ensure that intraband aliasing remains negligible
Figure 2-1: ISO/MPEG-1 layer I/II encoder [2]
Dynamic bit alllocation
32 ↓
Block companding quantization
32 Channel PQMF analysis bank
FFT computation (L1:512; L2:1024)
Trang 17For the purposes of psychoacoustic analysis and determination of just noticeable distortion (JND) thresholds, a (512 layer I) or 1024 (layer II) point FFT is computed in parallel with the subband decomposition for each decimated block of 12 input samples (8 ms at 48 kHz) Next, the subband are block companded (normalized by a scale factor) such that the maximum sample amplitude in each block is unity, then an iterative bit allocation procedure applies the JND threshold to select an optimal quantizer from a predetermined set for each subband Quantizers are selected such that both the masking and bit rate requirements are simultaneously satisfied In each subband, scale factors are quantized using 6 bits and quantizer selections are encoded using 4 bits
MPEG-1 Audio specifies three layers The different layers offer increasing higher audio quality at slightly increased complexity While Layers I and II share the basic structure of the encoding process having their roots in an ealier algorithm also known as MUSICAM, Layer III is substantially different
Layer I is the simplest layer and it operates at data rates between 32 and 224 kb/s per channel The preferred range of operation is above 128 kb/s Layer I finds an application, for example in the digital compact cassette, DCC, at 192 kb/s per channel Layer II is of medium complexity and it employs data rate between 32 and 192 kb/s per channel At 128 kb/s per channel it provides very good audio quality
The MPEG-1 Layer-II compression algorithm encodes audio signals as follows: the frequency spectrum of the audio signal, bandlimited to 20 kHz, is uniformed divided into32 subbands The subbands are assigned individual bit-allocations according to the audibility of quantisation noise within each subband A pyschoacoustic model of the ear analyses the audio signal and provides this information to the quantiser
Layer-II frames consist of 1152 samples; 3 groups of 12 samples from each of 32 subbands
A group of 12 samples gets a bit-allocation and, if this is non-zero, a scalefactor Scalefactors are weights that scale groups of 12 samples such that they fully use the range of the quantiser The scalefactor for such a group is determined by the next largest value (given
in a look-up table) to the maximum of the absolute values of the 12 samples Thus it provides an indication of the maximum power exhibited by any one of the 12 samples within the group
Trang 18.
Figure 2-2: Structure of Layer – II subband samples [5]
Scale Factor (6 bits)
Samples (2~16 bits)
Trang 19Chapter 3 – Movie highlight detection
This study focuses on audio, especially audio amplitude In movie, we have a lot of various events, i.e speech, music, speech with ground music, scream Usually, the audio amplitude event does not change much if the event just speech Exciting event detection in movie may
be a gunshot, an explosion, a laugh, a scream When an exciting event happens, the audio amplitude of event increases suddenly, i.e gunshot, loud voice
3.1 Getting Ground Truth
When we get the results from the automatic detection method, how do we know how it performs So we need a table of the exciting events To get this table, we have to do by hand
We call this work is Ground Truth To know exactly where the events happened in a movie
we need to watch the movie and to note the exciting events We need to know when the exciting events happen and how long it happens, we write all events information in a table: the event time, the event length In this step, we have a problem, it is our opinion because the event it may be exciting with us but it may not be exciting with someone That is a problem; we need to find the solution We can use the movie trailer to know more about the exciting movie when we do Ground Truth The movie trailer was done manually The movie trailer was done to advertise about the movie so in this case the exciting event may be in the movie trailer, but it is not all the exciting event was in the trailer We just refer the movie trailer to know how good the automatic method
When we do the Ground Truth, another problem is the length of the events Example: the event is a gunshot combine fighting, beating, so we need to choose the main event happen or
we can combine all of these events to become a big event In some cases, the big event has long happened – time, so the automatic detection can get result as much as we want
Trang 202 00.08.31 – 00.09.10 (531-550) Loud noise, scream, dump 19
5 00.26.20 – 00.27.41 (1580 – 1661) Buster, scream, drum -beat 81
6 00.30.00 – 00.32.56 (1800 – 1976) Drum-beat, buster, cracker,
wham, fighting, sound of spear flying
13 01.07.07 – 01.07.40 (4027 -4060) Whirr, scream, music 33
14 01.07.50 – 01.08.40 (4070 – 4120) Alarm, scream, shouting 50
15 01.14.20 – 01.16.47 (4460 – 4607) Scream, drum-beat, crunch,
clump, crash, footstep, loud noise
147
16 01.17.36 – 01.17.57 (4656 - 4677) Trumpet-call, battle-cry 21
18 01.21.11 – 01.21.40 (4871 -4900) Drum beating, fighting 29
19 01.21.47 – 01.22.39 (4907 – 4959) Shouting, drum beating,
Trang 211 00.00.39 – 00.00.56 (39 – 56) Music and name of music 17
2 00.00.58 – 00.03.51 (58 – 231) Speech, drum beating 173
4 00.07.30 – 00.08.00 (450 – 480) Gunshot, machine-gun shot 30
5 00.08.10 – 00.09.04 (490 – 544) Gunshot, machine-gun shot 54
6 00.09.28 – 00.10.33 (578 – 633) Loud voice, ambulance,
19 01.25.12 – 01.25.23 (5112 – 5123) Explosion, crashing, smash 11
20 01.25.30 – 01.25.44 (5130 – 5144) Explosion, gunshot, shouting 14
Trang 243.2 Automatic Detection
The audio amplitude gives a benefit feature to detect the exciting events In movie, the exciting events may be happen in milliseconds, the other events may happen in second or minute The other hand, some events has high audio amplitude but they are not exciting events Some events are exciting events but audio amplitude of these events is not as high as
we need to detect So in fact, this automatic method may miss some events Automatic detection is to get automatically the exciting events through an automatic system The system just detects the audio signal of the movie, suggests the exciting events and includes the length events
To use audio amplitude in our detection, we need to get only the audio from movie First of all, the audio was extracted separately from the movie In this step, the audio was saved as a
*.aud file The next step is to strip the scale factors in audio file Scale factors in audio movie gave a clear sight about the audio amplitude because the scale factor has information about the audio amplitude After we have scale factors of the audio file, we need to find the audio amplitude in one movie frame Audio amplitude in one frame was a good sight to know about the exciting events in movie Before doing those works, we need to choice the type of movie and type of movie file In some cases, the type of movie give us the better result to detect, i.e action movie is a good type of movie to work because the exciting event
in this movie type usually had a higher audio amplitude than the other events To get scale factor, we need to choose the movie type because this belong the compressed movie method
In this project, movie type is MPEG-1 and the audio type was MPEG-1 Layer II In fact, we need to study the audio amplitude so we could use MPEG-1 Layer II to make the sample study
When we have the audio level in one frame, we begin to analysis the audio amplitude of movie This study analysis based on the audio amplitude so we just focus about the threshold of the audio amplitude and the threshold time of the audio amplitude In once case,
we changed the value to compare and to find the better way to detect exciting movie
Trang 253.2.1 Getting Scale Factor
3.2.1.1 Reduction of cut-off frequency [4], [5]
One scale factor is computed for each 12 subband frequency samples (call a “granule”) The maximum value absolute value of the 12 – sample granule is determined and mapped to a scale factor value via a lookup table defined in the standard The samples in the granule are divided by the scale factor prior to the quantization stage The dynamic range covered by the scale factor is 120dB
Most of the energy in a speech signal lies between 0.1 kHz – 4 kHz According to the MPEG-1 Layer-II audio standard, the maximum allowable frequency component in the audio signal is at 20 kHz At the encoder, the frequency spectrum (0 – 20 kHz) is divided uniformly into 32 subbands, each having bandwidth of 0.625 kHz Thus, subbands 2 through
7 represent the frequency range from 0.625 kHz – 4.375 kHz
By limiting the audio examination to these subbands, which approximate the range of the speech band, we further concentrate the audio investigation on commentator vocals Therefore, the influence of the commentator on the generation of the audio amplitude profile
is increased It was expected that the examination of subbands 2 though 7 would provide for
a reasonable trade-off between rejection of low-frequency background noise and the capture
Trang 26The subband scale factor of a group of 12 samples effectively indicates the maximum power exhibited by any one sample within such a group Thus it provides a means by which a variable audio power level may be tracked on a per-12-sample basis without necessitating a decoded from the compressed bistream
The proposal was that an audio volume level for each video frame could be determined by a superposition of the scale factors corresponding to the groups of audio samples to which the video frames are temporally associated This volume level was expected to remain significantly low for silent video frames and to be high for video frames associated with more substantail audio content Thus by thresholding this value, the video frames may be assigned a silent/non-silent catagorisation as desired
Graph 3-1: Per-Frame Audio Amplitude level for example movie
A frame-by-frame audio amplitude profile was established by a superposition of all the scalefactors from subbands 2-7 over a window of length corresponding to one video frame (˜1/25s)
Trang 27Graph 3-2: Per-Second Audio Amplitude level for example movie
Scalefactors give video frame audio volume level
Scalefactors give video frame audio volume level
Scalefactors give video frame audio volume level
Scalefactors give video frame audio volume level
Figure 3-2: Video frame audio levels generated from scalefactors correspoding to
temporally associated audio [5]
Trang 283.2.2 Audio amplitude threshold
We study to find the best audio amplitude threshold to detect the exciting events in movies
In previous section, we strip the scale factor from the audio file, we then compute the audio amplitude in one frame that we call audio level or audio amplitude However, that is just the audio amplitude in one frame We may choose two directions to perform the event detection with the audio amplitude in one frame The first direction, that is to use audio amplitude in one second to detect event in movie The second direction, that is to use audio amplitude in various range time to detect event in movie
3.2.2.1 Audio amplitude detection in one second
We need to calculate the audio amplitude threshold to detect events in movie There are twenty five frames per second in movie There are some stages to find the audio amplitude threshold:
• The audio amplitude value in one second is calculated by averaging the audio amplitude value over twenty five frame series
• A medium audio amplitude value is calculated by averaging all audio amplitude value in one second over the entire movie
• The audio amplitude threshold is expressed by a multiplication between medium audio amplitude value and an optimum value Optimum value has been tried a lot values Optimum value in range [1.8; 2.2] gives better result than the others
When we get the audio amplitude threshold, we begin to detect the exciting events in movie The exciting events may happen in one second, or more The exciting events are picked out when audio amplitude value of at least two second series is lager than audio amplitude threshold
The exciting events detection was executed on three movies: Night at the Museum 2, The KingDom, The Legend of Butch and Sundance There are some detection result graphs of three movies