In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC sound scene control module automatically controls the sou
Trang 1Research Article
Automatic Sound Scene Control Using Image Sensor Network
Changhee Cho,1Jaehyung Park,2and Kwangki Kim3
1 Graduate School of Interdisciplinary Program of E-Commerce, Chonnam National University, Yongbong-dong, Buk-gu,
Gwangju 500-757, Republic of Korea
2 School of Electronics and Computer Engineering, Chonnam National University, Yongbong-dong, Buk-gu,
Gwangju 500-757, Republic of Korea
3 Department of Digital Contents, Korea Nazarene University, Cheonan 331-718, Republic of Korea
Correspondence should be addressed to Jaehyung Park; hyeoung@chonnam.ac.kr and Kwangki Kim; k2kim@kornu.ac.kr Received 28 February 2014; Accepted 14 April 2014; Published 5 May 2014
Academic Editor: Carlos Ramos
Copyright © 2014 Changhee Cho et al This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
We proposed the automatic sound scene control system using the image sensor network for preserving the constant sound scene without respect to the users’ movement In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC (sound scene control) module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information To estimate the direction of the human face, we used the normalized RGB (red, green, and blue) and the HSV (hue, saturation, and value) calculated from the images obtained by the image sensor network The direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV The estimated direction
of the human face is directly fed to the SSC module, and the controlled sound scene can be simply generated Experimental results show that the image sensor network successfully detected the human location with the accuracy of about 98% and the controlled sound scene by the SSC according to the detected human location was perceived as the original sound scene with the accuracy of 95%
1 Introduction
With increase in multichannel audio sources such as DVD
and consumer’s demand on more realistic audio service,
multichannel audio signals are getting more important in
the audio coding and the audio service In addition, as the
multichannel audio signals need very high bit-rate to be
transmitted, there have been many efforts to efficiently handle
the multichannel audio signals with respect to the bit-rate,
and the sound quality and spatial cue based multichannel
audio coding schemes such as binaural cue coding (BCC),
MPEG Surround, and sound source location coefficient
cod-ing have been introduced and developed [1–6] These
multi-channel audio coding schemes do not attempt to provide an
approximate reconstruction of the original multichannel
sig-nals’ waveforms and instead focus on delivering perceptually
satisfying replica of the original sound scene by exploiting
knowledge about human perception [7–9] In the spatial cue based multichannel audio coding, the spatial image of the multichannel audio signals is captured by a compact set of parameters, that is, the spatial cues, and a down-mix signal In other words, the multichannel audio signals are represented as a down-mix signal and small amount of side information while successfully preserving the sound image of the multichannel audio signals Accordingly, the spatial cue based multichannel audio coding can dramatically reduce the bit-rate and provide an extremely efficient representation of the multichannel audio signals
Apart from the coding efficiency and the sound quality of the spatial cue based multichannel audio coding determined
by the spatial cues, there is another merit to create valuable functionality in the multichannel audio coding through the usage of spatial cues Since the spatial image of the multichannel audio signals can be preserved by the spatial
http://dx.doi.org/10.1155/2014/621805
Trang 2cues, we can control the sound scene by the modification
of the spatial cues In other words, the spatial cues can be
utilized not only to keep the sound quality of the
multi-channel audio signals but also to change the sound scene of
the multichannel audio signals We call this sound scene
control (SSC) based on the spatial cues and this functionality
can provide users with interactivity [10] Moreover, the SSC
can be implemented in the frequency domain and it only
needs a few multiplications and additions; the complexity
of the spatial cue based multichannel audio coding is rarely
affected by the SSC [10] One of the possible applications of
the SSC is to combine sound scene controller with multiview
video In the case of multiview broadcasting, which will be
advent in near coming days, it is expected that multichannel
sound scene control can provide interactive audio playback
systems and realistic audio sound by synchronizing sound
scene with moving video scene
Meanwhile, users can perceive the original sound scene
of the multichannel audio signals produced by contents
providers when they locate at the center of the multichannel
speaker layout If users change their position, especially their
heads’ direction, they feel the different sound scene from
the original one due to the binaural effect [3, 11] In other
words, when the users’ position has changed, the original
sound scene should be controlled according to the users’ new
position so that the users can perceive the constant sound
scene without respect to their movement To achieve this goal,
we proposed an automatic sound scene control system using
image sensor network
In the proposed system, the human location (or the
direction of the human face) is detected by the image
sensor network and the sound scene of the multichannel
audio signals is automatically controlled by the previously
mentioned SSC module according to the estimated human
location which is the angle information The image sensor
network consists of twelve image sensors to be uniformly
arranged in the multichannel playback environment and it
has 30-degree resolution for detecting the direction of the
human face To estimate the direction of the human face,
we used the normalized RGB (red, green, and blue) and the
HSV (hue, saturation, and value) calculated from the images
obtained by the image sensor network [12–14] It is because
the normalized RGB and the HSV are useful for detecting the
human skin region in the images In addition, as the image
obtained by the image sensor to be located at the direction
of the human face includes many pixels with the normalized
RGB and the HSV values to satisfy thresholds for detecting
the human skin, the direction of the human face can be
easily decided as the image sensor to capture the image with
the highest number of pixels to satisfy the thresholds of the
normalized RGB and the HSV The estimated direction of
the human face is directly fed to the SSC module, and the
controlled sound scene can be simply generated
The paper is organized as follows In Section 2, the
sound scene control method in MPEG Surround which is
representative of the multichannel audio coders is presented
InSection 3, the estimation of the direction of the human face
using the image sensor network and the proposed automatic
sound scene control system are described In Sections4and5, experimental results and conclusion are drawn, respectively
2 Sound Scene Control in MPEG Surround
MPEG Surround is a technology to represent multichannel audio signals as the down-mix signal and spatial cues [1–
3] The MPEG Surround only uses the down-mix signal and the additional side information, that is, spatial cues, for the transmission of the multichannel audio signals through wired/wireless network system Therefore, users can enjoy a realistic audio sound by the multichannel audio signals through multichannel audio services such as digital audio broadcasting and digital multimedia broadcasting in wired/wireless network environment
The MPEG Surround uses channel level difference (CLD) and interchannel correlation (ICC) as spatial cues The CLD
is a main parameter in the MPEG Surround because it determines the spectral power of the reconstructed multi-channel audio signals and occupies a considerable amount
of the side information [4] However, the ICC is an ancil-lary parameter in the MPEG Surround because it reflects the spatial diffuseness of the recovered multichannel audio signals and takes a small portion of side information As the multichannel audio sound is compressed and recovered using the down-mix signal and the spatial parameters, the performance of the MPEG Surround in the aspects of the coding efficiency and the sound quality is determined by the spatial parameters In other words, the sound image formed
by the multichannel audio signals is captured and recovered
by the CLD and the ICC Under this knowledge, we can control the sound scene of the multichannel audio signals
by the modification of the spatial parameters, alternatively The SSC is a new tool to reproduce a new sound scene of the multichannel audio signals according to the global panning position which is freely inputted by a user or another system
By the given panning angle, denoted by𝜃pan, the multichannel audio signals are rotated with a degree of 𝜃pan To control the sound scene, the SCC modifies spatial cues according
to inputted sound scene information and then generates the modified spatial cues Finally, the MPEG Surround decoder generates the multichannel audio signals with the controlled sound scene using the modified spatial cues.Figure 1shows the structure of MPEG Surround with the SSC
The procedure of the SSC in the MPEG Surround is shown inFigure 2 At first, the spatial parameters such as the CLD and the ICC are parsed from the transmitted spatial parameter bit-stream And then they are modified according
to𝜃panwhich is sound scene information Here, the CLD and the ICC are separately controlled and it is assumed that𝜃pan
is fed to each modification module These modified spatial parameters are formatted again and finally the modified spatial parameter bit-stream is generated
To modify the CLD, it is processed by gain factor converter, constant power panning (CPP), and CLD con-verter, sequentially In the gain factor concon-verter, the CLD
is converted to each channel level gain per each subband
Trang 3Modified spatial cue bitstream Spatial cue
bitstream
MPEG Surround encoder
encoder
MPEG Surround decoder
Multichannel
input
Reconstructed Multichannel output bitstream
Sound scene
scene control
information
Reconstructed
Down-mix
decoder Down-mix Down-mix
Sound
Figure 1: MPEG Surround with sound scene control module
Modified spatial cue bitstream Modified CLD
Modified ICC ICC
CLD
Bitstream deformatter
Spatial cue bitstream
CLD converter
Constant power panning
Gain factor converter
ICC modification
Bitstream formatter
Figure 2: Procedure of sound scene control
The gain factors are simply calculated from the CLD as the
following formula:
√1 + 10CLD𝑏/10 ,
𝐺𝑖+1𝑏 = 𝐺𝑖𝑏⋅ 10CLD𝑏/20,
(1)
where the𝐺𝑖
𝑏is the gain factor, the superscription index𝑖 is
channel index, and subscription index𝑏 is subband index It is
known that one CLD per subband can provide two channels’
power gain This gain converter is applied to all CLDs and
each channel level gain can be easily obtained by multiplying
all gain factors related to each channel
In CPP module, the CPP law is applied to manipulate the
position of each channel according to desired sound scene [15,
16] Let us assume that if the channel gain𝐺𝑖
𝑏is desired to
be positioned at𝜃panwhich is located between the left front
(Lf) and the left surround (Ls) as shown inFigure 3, the𝐺𝑖
𝑏is projected to Lf and Ls channels as follows:
𝜃𝑚= (𝜃pan− 𝜃1) (aperture − 𝜃1)×
𝜋
2,
𝐺Lf
𝑏,new= 𝐺Lf
𝑏 + cos (𝜃𝑚) ⋅ 𝐺𝑖𝑏,
𝐺Ls
𝑏,new= 𝐺Ls
𝑏 + sin (𝜃𝑚) ⋅ 𝐺𝑖𝑏,
(2)
where𝜃𝑚 is the normalized angle limited to 90 degrees and
aperture is the angle between two channels In the same
𝜃1
𝜃m
Aperture
Figure 3: An example of constant power panning law between two channels
manner, any other channel gain can be flexibly handled to form the desired sound scene After the CPP processing, the modified CLDs are newly estimated using all new channel gains in CLD converter Here, the CLD converter is exactly the same as the CLD extractor of the MPEG Surround encoder If the CLD is estimated between Lf and Ls channels, the modified CLD is calculated as follows:
CLDLf𝑏,new,Ls = 10 log10(𝐺
Ls 𝑏,new
𝐺Lf 𝑏,new
)
2
To perfectly modify the ICC, the ICC must be reestimated according to the controlled sound scene But, different from the CLD, the ICC cannot be reestimated in parameter domain since the degree of correlation between the channels is only
Trang 4able to be estimated in signal domain Due to this problem,
the ICC cannot be perfectly controlled and it could result
in the degradation of overall sound quality after changing
the sound scene In spite of this restriction, two kinds of
ICC parameter could be modified in the case of sound scene
rotation:
ICCLs,Lf = (1 − 𝜂) ICCLs,Lf+ 𝜂ICCRs,Rf,
ICCRs ,Rf = (1 − 𝜂) ICCRs ,Rf+ 𝜂ICCLs ,Lf, (4)
where𝜂 is denoted by
𝜂 =
{ { {
𝜃pan
1 −𝜃pan− 𝜋
𝜋 , 𝜃pan> 𝜋
(5)
These equations mean that left and right half plane ICC
parameters are totally cross-changed if the degree of scene
rotation is equal to the 180 degrees In the case that rotation
angle is increased greater than 180 degrees to 360 degrees,
the reverse cross-changed is ocurred and the modified ICC
parameters are equal to the original ones at 360 degrees This
concept of modification originated from common smoothing
technique used in the MPEG Surround [3,4]
3 The Proposed Automatic Sound
Scene Control System Using Image
Sensor Network
As the human perception of the sound scene is decided
by the human localization in the multichannel playback
environment, the image sensors are located around the
multichannel speaker layout as shown inFigure 4 Figure 4
shows that the twelve image sensors are uniformly distributed
in the multichannel playback environment and the resolution
of the human localization is 30 degrees
Although the direction of the human face is an important
factor for the perception of the sound scene, the precise
recognition of the human face is not necessary in our
proposed system As the image obtained by the image sensor
to be located at the direction of the human face includes
many pixels with the normalized RGB and the HSV values to
satisfy thresholds for detecting the human skin, the direction
of the human face can be easily decided as the image sensor to
capture the image with the highest number of pixels to satisfy
the thresholds of the normalized RGB and the HSV
From the past research results, it has been confirmed that
human skin colors cluster in a small region in RGB color
space and human skin colors differ more in brightness than
in colors [12–14] Therefore, the normalized RGB value can
be used to detect the human faces with less variance in color
Generally, colors of each pixel in image are represented
by the combination of 𝑅, 𝐺, and 𝐵 components and the
brightness value is calculated as
Figure 4: Image sensor network in the multichannel playback environment
where the range of each RGB component is 0 to 255 As the color information is very sensitive to the brightness values of the pixel, the RGB components are normalized as
𝑟 = 𝑅
𝐺
𝐵
where the sum of𝑟, 𝑔, and 𝑏 is 1 Thus, the normalized color values can be expressed only with𝑟 and 𝑔
In addition to the normalized RGB value, we use the HSV (hue, saturation, and value) as additional parameter for the direction recognition of the human face [12] It is
because the HSV is more similar to the human perception of
color At first, the hue (𝐻) indicates a measure of the spectral composition and it is represented as an angle which varies from 0 to 360 degrees Second, the saturation (𝑆) is the purity
of colors and it varies from 0 to 1 At last, the value (𝑉) is defined as the darkness of a color and it ranges also from 0
to 1 The HSV values can be simply calculated from the RGB
values using the following equations:
𝐻1 = cos−1{{
{
0.5 [(𝑅 − 𝐺) + (𝑅 − 𝐵)]
√(𝑅 − 𝐺)2+ (𝑅 − 𝐵) (𝐺 − 𝐵)
} } } ,
𝐻 = {𝐻1360 − 𝐻1 if 𝐵 > 𝐺,if 𝐵 ≤ 𝐺
𝑆 = max(𝑅, 𝐺, 𝐵) − min (𝑅, 𝐺, 𝐵)
𝑉 = max(𝑅, 𝐺, 𝐵)255
(8)
Trang 5Obtained images
by sensors
RGB reading
Normalized
RGB
calculation
HSV
calculation
Skin-like pixel counting
RGB
Normalized
Decision of human localization Number of skin-like pixels
Location of decided image sensor (angle) Figure 5: Procedure of the human localization using image sensor
network
We used the following threshold values of the normalized
RGB and the HSV for the decision of the human skin-like
pixels [12]:
0.36 ≤ 𝑟 ≤ 0.465 0.28 ≤ 𝑔 ≤ 0.363,
0 ≤ 𝐻 ≤ 50 0.20 ≤ 𝑆 ≤ 0.68 0.35 ≤ 𝑉 ≤ 1.0
(9)
A pixel of the obtained image by sensor is judged as the
human skin-like pixel only if its normalized RGB and HSV
values satisfy all thresholds of (9) Under this knowledge, all
skin-like pixels in the captured image by sensor are counted
and the location of the image sensor with the highest number
of skin-like pixels is determined as the direction of the human
face Figure 5 shows the whole procedure for the human
localization using image sensor network
The proposed automatic sound scene control system
using image sensor network is shown inFigure 6 Compared
toFigure 1, the input angle to the sound scene control module
is only replaced by the estimated direction of human face
according to the human movement Therefore, the explained
sound scene control module in Section 2 can be directly
used for the proposed automatic sound scene control system
without any changes in the operation The proposed system
has the sound scene control error as maximum 15 degrees
since the image sensor network has 30-degree resolution for
the estimation of the direction of the human face
Table 1: Test items
Table 2: Recognition rate of the image sensor network Position
(degree)
True recognition
False recognition (recognized angle)
Recognition rate (%)
4 Experimental Results
To validate the performance of the proposed automatic sound scene control system using the image sensor network,
we performed a subjective listening test which focused on the sensing ability of the image sensor network and the controllability of the SSC according to the result of the image sensor network For the listening test, the five test items offered by MPEG audio subgroup were used and are listed
inTable 1[17] The items were sampled at 44.1 kHz with 16-bit resolution and were all shorter than 20 seconds Eight listeners participated in the listening test
To check the sensing ability of the image sensor net-work, the estimated result by the image sensor network was compared to the listeners’ position, that is, the direction of their face, when they moved Here, for the clarification of the test, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, and
330 degrees are only allowed as the listeners’ position All listeners changed their position into the given angles 3 times and the total number of trials was 264.Table 2andFigure 7 show the recognition result by the image sensor network The recognition rate of the image sensor network is about 98.1% and only 5 trials were recognized as the wrong position Here, a main reason for the false recognition was the listeners’ wrong head direction
To check the controllability of the SSC, we used two kinds of audio sounds—the original and controlled ones The original sound scene was given as the reference signal and the listeners decided whether the controlled sound scene according to their position was equal to the original sound
Trang 6Modified spatial cue bitstream Spatial cue
bitstream
MPEG Surround encoder
encoder decoder
MPEG Surround decoder
Multichannel input
Realistic audio sound bitstream
Sound scene control
Reconstructed Down-mix
Down-mix Down-mix
Down-mix
down-mix
Image sensor network
Human movement localizationHuman
Direction of human face (angle)
Captured images
Multi-channel playback system
Reconstructed multi-channel output with controlled sound scene
Figure 6: The proposed automatic sound scene control system using image sensor network
0
10
20
30
40
50
60
70
80
90
100
30 60 90 120 150 180 210 240 270 300 330 Total
Position (deg) Figure 7: Recognition rate of the image sensor network
scene or not when they moved For the simplification of the
test, 60, 120, 180, 240, and 300 degrees were used as the
listeners’ position All listeners changed their position into
the given angles per each test item and the total number of
trials was 200 Here, if the estimated listeners’ position was
wrong, it was deleted and the listeners tried again at the same
position.Table 3 andFigure 8show the result for checking
the controllability of the SCC The ratio that the controlled
sound scene by the SCC was perceived as the original sound
scene was 95% Because the SCC has a problem about the ICC
modification as previously described, the newly generated
audio sound by the SCC showed the different sound scene
in some trials
5 Conclusion
In this paper, we proposed the automatic sound scene
control system using the image sensor network for preserving
Table 3: Controllability result of the proposed system using the image sensor network
Position (degree) Same sound scene Different sound scene
Accuracy (%)
0 10 20 30 40 50 60 70 80 90 100
Position (deg) Figure 8: Controllability result of the proposed system using the image sensor network
the constant sound scene without respect to the users’ move-ment In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information
Trang 7To estimate the direction of the human face, we used the
normalized RGB and the HSV calculated from the images
obtained by the image sensor network The direction of the
human face can be easily decided as the image sensor to
capture the image with the highest number of pixels to satisfy
the thresholds of the normalized RGB and the HSV The
estimated direction of the human face is directly fed to the
SSC module, and the controlled sound scene can be simply
generated
Experimental results show that the image sensor network
can successfully detect the human location with the accuracy
of about 98% Moreover, the controlled sound scene by the
SSC according to the detected human location was perceived
as the original sound scene with the accuracy of 95% To
enhance the performance of the image sensor network and
the SCC, more precise human localization using the eye
detection in the image remains as a future work
Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper
Acknowledgments
This study was funded by the research fund of Korea
Nazarene University in 2014 (Kwangki Kim) and supported
by Basic Science Research Program through the National
Research Foundation of Korea (NRF) funded by the
Min-istry of Education, Science, and Technology (Grant no
2012R1A1A4A01004195)
References
[1] C Faller and F Baumgarte, “Binaural cue coding—part II:
schemes and applications,” IEEE Transactions on Speech and
Audio Processing, vol 11, no 6, pp 520–531, 2003.
[2] F Baumgarte and C Faller, “Binaural cue coding—part I:
psychoacoustic fundamentals and design principles,” IEEE
Transactions on Speech and Audio Processing, vol 11, no 6, pp.
509–519, 2003
[3] J Herre, H Purnhagen, and J Breebard, “The reference model
architechture for MPEG spatial audio coding,” in Proceedings of
the 118th AES Convention, Barcelona, Spain, 2005.
[4] ISO/IEC 23003-1, “Information Technology—MPEG
AudioTechnologies—Part 1: MPEG Surround,” 2007
[5] H.-G Moon, J.-I Seo, S Baek, and K.-M Sung, “A
multi-channel audio compression method with virtual source location
information for MPEG-4 SAC,” IEEE Transactions on Consumer
Electronics, vol 51, no 4, pp 1253–1259, 2005.
[6] S Beack, J Seo, H Moon, K Kang, and M Hahn, “Angle-based
virtual source location representation for spatial audio coding,”
ETRI Journal, vol 28, no 2, pp 219–222, 2006.
[7] D A Burgess, “Techniques for Low Cost Spatial Audio,” UIST,
1992
[8] S H Foster, E M Wenzel, and R M Taylor, Real-Time Synthesis
of Complex Acoustic Environments, Crystal River Engineering,
Groveland, Calif, USA
[9] J Blauert, Spatial Hearing The Psychophysics of Human Sound
Localization, MIT Press, Cambridge, Mass, USA, 1983.
[10] K Kim, “Sound scene control of multi-channel audio signals for
realistic audio service in wired/wireless network,” International
Journal of Multimedia and Ubiquitous Engineering, vol 9, no 2,
2014
[11] E Zwicker and H Fastl, Psychoacoustics, Springer, Berlin,
Germany, 1999
[12] Y Wang and B Yuan, “A novel approach for human face detection from color images under complex background,”
Pattern Recognition, vol 34, no 10, pp 1983–1992, 2001.
[13] S.-H Kim and H.-G Kim, “Facial region detection using range
color information,” IEICE Transactions on Information and
Systems, vol 81, no 9, pp 968–975, 1998.
[14] J Yang and A Waibel, “Tracking human faces in real time,” CMU-CS-95-210, 1995
[15] V Pulkki, “Virtual sound source positioning using vector base
amplitude panning,” AES: Journal of the Audio Engineering
Society, vol 45, no 6, pp 456–465, 1997.
[16] M A Gerzon, “Panpot laws for multispeaker stereo,” in
Proceed-ings of the 92nd Convention of the AES, 1992.
[17] ISO/IEC JTC1/SC29/WG11 (MPEG), “Procedures for the Eval-uation of Spatial Audio Coding Systems,” Document N6691, Redmond, Wash, USA, 2004
Trang 8Publishing Corporation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission However, users may print, download, or email articles for individual use.