1. Trang chủ
  2. » Giáo án - Bài giảng

automatic sound scene control using image sensor network

8 2 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 8
Dung lượng 573,3 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC sound scene control module automatically controls the sou

Trang 1

Research Article

Automatic Sound Scene Control Using Image Sensor Network

Changhee Cho,1Jaehyung Park,2and Kwangki Kim3

1 Graduate School of Interdisciplinary Program of E-Commerce, Chonnam National University, Yongbong-dong, Buk-gu,

Gwangju 500-757, Republic of Korea

2 School of Electronics and Computer Engineering, Chonnam National University, Yongbong-dong, Buk-gu,

Gwangju 500-757, Republic of Korea

3 Department of Digital Contents, Korea Nazarene University, Cheonan 331-718, Republic of Korea

Correspondence should be addressed to Jaehyung Park; hyeoung@chonnam.ac.kr and Kwangki Kim; k2kim@kornu.ac.kr Received 28 February 2014; Accepted 14 April 2014; Published 5 May 2014

Academic Editor: Carlos Ramos

Copyright © 2014 Changhee Cho et al This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

We proposed the automatic sound scene control system using the image sensor network for preserving the constant sound scene without respect to the users’ movement In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC (sound scene control) module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information To estimate the direction of the human face, we used the normalized RGB (red, green, and blue) and the HSV (hue, saturation, and value) calculated from the images obtained by the image sensor network The direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV The estimated direction

of the human face is directly fed to the SSC module, and the controlled sound scene can be simply generated Experimental results show that the image sensor network successfully detected the human location with the accuracy of about 98% and the controlled sound scene by the SSC according to the detected human location was perceived as the original sound scene with the accuracy of 95%

1 Introduction

With increase in multichannel audio sources such as DVD

and consumer’s demand on more realistic audio service,

multichannel audio signals are getting more important in

the audio coding and the audio service In addition, as the

multichannel audio signals need very high bit-rate to be

transmitted, there have been many efforts to efficiently handle

the multichannel audio signals with respect to the bit-rate,

and the sound quality and spatial cue based multichannel

audio coding schemes such as binaural cue coding (BCC),

MPEG Surround, and sound source location coefficient

cod-ing have been introduced and developed [1–6] These

multi-channel audio coding schemes do not attempt to provide an

approximate reconstruction of the original multichannel

sig-nals’ waveforms and instead focus on delivering perceptually

satisfying replica of the original sound scene by exploiting

knowledge about human perception [7–9] In the spatial cue based multichannel audio coding, the spatial image of the multichannel audio signals is captured by a compact set of parameters, that is, the spatial cues, and a down-mix signal In other words, the multichannel audio signals are represented as a down-mix signal and small amount of side information while successfully preserving the sound image of the multichannel audio signals Accordingly, the spatial cue based multichannel audio coding can dramatically reduce the bit-rate and provide an extremely efficient representation of the multichannel audio signals

Apart from the coding efficiency and the sound quality of the spatial cue based multichannel audio coding determined

by the spatial cues, there is another merit to create valuable functionality in the multichannel audio coding through the usage of spatial cues Since the spatial image of the multichannel audio signals can be preserved by the spatial

http://dx.doi.org/10.1155/2014/621805

Trang 2

cues, we can control the sound scene by the modification

of the spatial cues In other words, the spatial cues can be

utilized not only to keep the sound quality of the

multi-channel audio signals but also to change the sound scene of

the multichannel audio signals We call this sound scene

control (SSC) based on the spatial cues and this functionality

can provide users with interactivity [10] Moreover, the SSC

can be implemented in the frequency domain and it only

needs a few multiplications and additions; the complexity

of the spatial cue based multichannel audio coding is rarely

affected by the SSC [10] One of the possible applications of

the SSC is to combine sound scene controller with multiview

video In the case of multiview broadcasting, which will be

advent in near coming days, it is expected that multichannel

sound scene control can provide interactive audio playback

systems and realistic audio sound by synchronizing sound

scene with moving video scene

Meanwhile, users can perceive the original sound scene

of the multichannel audio signals produced by contents

providers when they locate at the center of the multichannel

speaker layout If users change their position, especially their

heads’ direction, they feel the different sound scene from

the original one due to the binaural effect [3, 11] In other

words, when the users’ position has changed, the original

sound scene should be controlled according to the users’ new

position so that the users can perceive the constant sound

scene without respect to their movement To achieve this goal,

we proposed an automatic sound scene control system using

image sensor network

In the proposed system, the human location (or the

direction of the human face) is detected by the image

sensor network and the sound scene of the multichannel

audio signals is automatically controlled by the previously

mentioned SSC module according to the estimated human

location which is the angle information The image sensor

network consists of twelve image sensors to be uniformly

arranged in the multichannel playback environment and it

has 30-degree resolution for detecting the direction of the

human face To estimate the direction of the human face,

we used the normalized RGB (red, green, and blue) and the

HSV (hue, saturation, and value) calculated from the images

obtained by the image sensor network [12–14] It is because

the normalized RGB and the HSV are useful for detecting the

human skin region in the images In addition, as the image

obtained by the image sensor to be located at the direction

of the human face includes many pixels with the normalized

RGB and the HSV values to satisfy thresholds for detecting

the human skin, the direction of the human face can be

easily decided as the image sensor to capture the image with

the highest number of pixels to satisfy the thresholds of the

normalized RGB and the HSV The estimated direction of

the human face is directly fed to the SSC module, and the

controlled sound scene can be simply generated

The paper is organized as follows In Section 2, the

sound scene control method in MPEG Surround which is

representative of the multichannel audio coders is presented

InSection 3, the estimation of the direction of the human face

using the image sensor network and the proposed automatic

sound scene control system are described In Sections4and5, experimental results and conclusion are drawn, respectively

2 Sound Scene Control in MPEG Surround

MPEG Surround is a technology to represent multichannel audio signals as the down-mix signal and spatial cues [1–

3] The MPEG Surround only uses the down-mix signal and the additional side information, that is, spatial cues, for the transmission of the multichannel audio signals through wired/wireless network system Therefore, users can enjoy a realistic audio sound by the multichannel audio signals through multichannel audio services such as digital audio broadcasting and digital multimedia broadcasting in wired/wireless network environment

The MPEG Surround uses channel level difference (CLD) and interchannel correlation (ICC) as spatial cues The CLD

is a main parameter in the MPEG Surround because it determines the spectral power of the reconstructed multi-channel audio signals and occupies a considerable amount

of the side information [4] However, the ICC is an ancil-lary parameter in the MPEG Surround because it reflects the spatial diffuseness of the recovered multichannel audio signals and takes a small portion of side information As the multichannel audio sound is compressed and recovered using the down-mix signal and the spatial parameters, the performance of the MPEG Surround in the aspects of the coding efficiency and the sound quality is determined by the spatial parameters In other words, the sound image formed

by the multichannel audio signals is captured and recovered

by the CLD and the ICC Under this knowledge, we can control the sound scene of the multichannel audio signals

by the modification of the spatial parameters, alternatively The SSC is a new tool to reproduce a new sound scene of the multichannel audio signals according to the global panning position which is freely inputted by a user or another system

By the given panning angle, denoted by𝜃pan, the multichannel audio signals are rotated with a degree of 𝜃pan To control the sound scene, the SCC modifies spatial cues according

to inputted sound scene information and then generates the modified spatial cues Finally, the MPEG Surround decoder generates the multichannel audio signals with the controlled sound scene using the modified spatial cues.Figure 1shows the structure of MPEG Surround with the SSC

The procedure of the SSC in the MPEG Surround is shown inFigure 2 At first, the spatial parameters such as the CLD and the ICC are parsed from the transmitted spatial parameter bit-stream And then they are modified according

to𝜃panwhich is sound scene information Here, the CLD and the ICC are separately controlled and it is assumed that𝜃pan

is fed to each modification module These modified spatial parameters are formatted again and finally the modified spatial parameter bit-stream is generated

To modify the CLD, it is processed by gain factor converter, constant power panning (CPP), and CLD con-verter, sequentially In the gain factor concon-verter, the CLD

is converted to each channel level gain per each subband

Trang 3

Modified spatial cue bitstream Spatial cue

bitstream

MPEG Surround encoder

encoder

MPEG Surround decoder

Multichannel

input

Reconstructed Multichannel output bitstream

Sound scene

scene control

information

Reconstructed

Down-mix

decoder Down-mix Down-mix

Sound

Figure 1: MPEG Surround with sound scene control module

Modified spatial cue bitstream Modified CLD

Modified ICC ICC

CLD

Bitstream deformatter

Spatial cue bitstream

CLD converter

Constant power panning

Gain factor converter

ICC modification

Bitstream formatter

Figure 2: Procedure of sound scene control

The gain factors are simply calculated from the CLD as the

following formula:

√1 + 10CLD𝑏/10 ,

𝐺𝑖+1𝑏 = 𝐺𝑖𝑏⋅ 10CLD𝑏/20,

(1)

where the𝐺𝑖

𝑏is the gain factor, the superscription index𝑖 is

channel index, and subscription index𝑏 is subband index It is

known that one CLD per subband can provide two channels’

power gain This gain converter is applied to all CLDs and

each channel level gain can be easily obtained by multiplying

all gain factors related to each channel

In CPP module, the CPP law is applied to manipulate the

position of each channel according to desired sound scene [15,

16] Let us assume that if the channel gain𝐺𝑖

𝑏is desired to

be positioned at𝜃panwhich is located between the left front

(Lf) and the left surround (Ls) as shown inFigure 3, the𝐺𝑖

𝑏is projected to Lf and Ls channels as follows:

𝜃𝑚= (𝜃pan− 𝜃1) (aperture − 𝜃1)×

𝜋

2,

𝐺Lf

𝑏,new= 𝐺Lf

𝑏 + cos (𝜃𝑚) ⋅ 𝐺𝑖𝑏,

𝐺Ls

𝑏,new= 𝐺Ls

𝑏 + sin (𝜃𝑚) ⋅ 𝐺𝑖𝑏,

(2)

where𝜃𝑚 is the normalized angle limited to 90 degrees and

aperture is the angle between two channels In the same

𝜃1

𝜃m

Aperture

Figure 3: An example of constant power panning law between two channels

manner, any other channel gain can be flexibly handled to form the desired sound scene After the CPP processing, the modified CLDs are newly estimated using all new channel gains in CLD converter Here, the CLD converter is exactly the same as the CLD extractor of the MPEG Surround encoder If the CLD is estimated between Lf and Ls channels, the modified CLD is calculated as follows:

CLDLf𝑏,new,Ls = 10 log10(𝐺

Ls 𝑏,new

𝐺Lf 𝑏,new

)

2

To perfectly modify the ICC, the ICC must be reestimated according to the controlled sound scene But, different from the CLD, the ICC cannot be reestimated in parameter domain since the degree of correlation between the channels is only

Trang 4

able to be estimated in signal domain Due to this problem,

the ICC cannot be perfectly controlled and it could result

in the degradation of overall sound quality after changing

the sound scene In spite of this restriction, two kinds of

ICC parameter could be modified in the case of sound scene

rotation:

ICCLs,Lf = (1 − 𝜂) ICCLs,Lf+ 𝜂ICCRs,Rf,

ICCRs ,Rf = (1 − 𝜂) ICCRs ,Rf+ 𝜂ICCLs ,Lf, (4)

where𝜂 is denoted by

𝜂 =

{ { {

𝜃pan

1 −𝜃pan− 𝜋

𝜋 , 𝜃pan> 𝜋

(5)

These equations mean that left and right half plane ICC

parameters are totally cross-changed if the degree of scene

rotation is equal to the 180 degrees In the case that rotation

angle is increased greater than 180 degrees to 360 degrees,

the reverse cross-changed is ocurred and the modified ICC

parameters are equal to the original ones at 360 degrees This

concept of modification originated from common smoothing

technique used in the MPEG Surround [3,4]

3 The Proposed Automatic Sound

Scene Control System Using Image

Sensor Network

As the human perception of the sound scene is decided

by the human localization in the multichannel playback

environment, the image sensors are located around the

multichannel speaker layout as shown inFigure 4 Figure 4

shows that the twelve image sensors are uniformly distributed

in the multichannel playback environment and the resolution

of the human localization is 30 degrees

Although the direction of the human face is an important

factor for the perception of the sound scene, the precise

recognition of the human face is not necessary in our

proposed system As the image obtained by the image sensor

to be located at the direction of the human face includes

many pixels with the normalized RGB and the HSV values to

satisfy thresholds for detecting the human skin, the direction

of the human face can be easily decided as the image sensor to

capture the image with the highest number of pixels to satisfy

the thresholds of the normalized RGB and the HSV

From the past research results, it has been confirmed that

human skin colors cluster in a small region in RGB color

space and human skin colors differ more in brightness than

in colors [12–14] Therefore, the normalized RGB value can

be used to detect the human faces with less variance in color

Generally, colors of each pixel in image are represented

by the combination of 𝑅, 𝐺, and 𝐵 components and the

brightness value is calculated as

Figure 4: Image sensor network in the multichannel playback environment

where the range of each RGB component is 0 to 255 As the color information is very sensitive to the brightness values of the pixel, the RGB components are normalized as

𝑟 = 𝑅

𝐺

𝐵

where the sum of𝑟, 𝑔, and 𝑏 is 1 Thus, the normalized color values can be expressed only with𝑟 and 𝑔

In addition to the normalized RGB value, we use the HSV (hue, saturation, and value) as additional parameter for the direction recognition of the human face [12] It is

because the HSV is more similar to the human perception of

color At first, the hue (𝐻) indicates a measure of the spectral composition and it is represented as an angle which varies from 0 to 360 degrees Second, the saturation (𝑆) is the purity

of colors and it varies from 0 to 1 At last, the value (𝑉) is defined as the darkness of a color and it ranges also from 0

to 1 The HSV values can be simply calculated from the RGB

values using the following equations:

𝐻1 = cos−1{{

{

0.5 [(𝑅 − 𝐺) + (𝑅 − 𝐵)]

√(𝑅 − 𝐺)2+ (𝑅 − 𝐵) (𝐺 − 𝐵)

} } } ,

𝐻 = {𝐻1360 − 𝐻1 if 𝐵 > 𝐺,if 𝐵 ≤ 𝐺

𝑆 = max(𝑅, 𝐺, 𝐵) − min (𝑅, 𝐺, 𝐵)

𝑉 = max(𝑅, 𝐺, 𝐵)255

(8)

Trang 5

Obtained images

by sensors

RGB reading

Normalized

RGB

calculation

HSV

calculation

Skin-like pixel counting

RGB

Normalized

Decision of human localization Number of skin-like pixels

Location of decided image sensor (angle) Figure 5: Procedure of the human localization using image sensor

network

We used the following threshold values of the normalized

RGB and the HSV for the decision of the human skin-like

pixels [12]:

0.36 ≤ 𝑟 ≤ 0.465 0.28 ≤ 𝑔 ≤ 0.363,

0 ≤ 𝐻 ≤ 50 0.20 ≤ 𝑆 ≤ 0.68 0.35 ≤ 𝑉 ≤ 1.0

(9)

A pixel of the obtained image by sensor is judged as the

human skin-like pixel only if its normalized RGB and HSV

values satisfy all thresholds of (9) Under this knowledge, all

skin-like pixels in the captured image by sensor are counted

and the location of the image sensor with the highest number

of skin-like pixels is determined as the direction of the human

face Figure 5 shows the whole procedure for the human

localization using image sensor network

The proposed automatic sound scene control system

using image sensor network is shown inFigure 6 Compared

toFigure 1, the input angle to the sound scene control module

is only replaced by the estimated direction of human face

according to the human movement Therefore, the explained

sound scene control module in Section 2 can be directly

used for the proposed automatic sound scene control system

without any changes in the operation The proposed system

has the sound scene control error as maximum 15 degrees

since the image sensor network has 30-degree resolution for

the estimation of the direction of the human face

Table 1: Test items

Table 2: Recognition rate of the image sensor network Position

(degree)

True recognition

False recognition (recognized angle)

Recognition rate (%)

4 Experimental Results

To validate the performance of the proposed automatic sound scene control system using the image sensor network,

we performed a subjective listening test which focused on the sensing ability of the image sensor network and the controllability of the SSC according to the result of the image sensor network For the listening test, the five test items offered by MPEG audio subgroup were used and are listed

inTable 1[17] The items were sampled at 44.1 kHz with 16-bit resolution and were all shorter than 20 seconds Eight listeners participated in the listening test

To check the sensing ability of the image sensor net-work, the estimated result by the image sensor network was compared to the listeners’ position, that is, the direction of their face, when they moved Here, for the clarification of the test, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, and

330 degrees are only allowed as the listeners’ position All listeners changed their position into the given angles 3 times and the total number of trials was 264.Table 2andFigure 7 show the recognition result by the image sensor network The recognition rate of the image sensor network is about 98.1% and only 5 trials were recognized as the wrong position Here, a main reason for the false recognition was the listeners’ wrong head direction

To check the controllability of the SSC, we used two kinds of audio sounds—the original and controlled ones The original sound scene was given as the reference signal and the listeners decided whether the controlled sound scene according to their position was equal to the original sound

Trang 6

Modified spatial cue bitstream Spatial cue

bitstream

MPEG Surround encoder

encoder decoder

MPEG Surround decoder

Multichannel input

Realistic audio sound bitstream

Sound scene control

Reconstructed Down-mix

Down-mix Down-mix

Down-mix

down-mix

Image sensor network

Human movement localizationHuman

Direction of human face (angle)

Captured images

Multi-channel playback system

Reconstructed multi-channel output with controlled sound scene

Figure 6: The proposed automatic sound scene control system using image sensor network

0

10

20

30

40

50

60

70

80

90

100

30 60 90 120 150 180 210 240 270 300 330 Total

Position (deg) Figure 7: Recognition rate of the image sensor network

scene or not when they moved For the simplification of the

test, 60, 120, 180, 240, and 300 degrees were used as the

listeners’ position All listeners changed their position into

the given angles per each test item and the total number of

trials was 200 Here, if the estimated listeners’ position was

wrong, it was deleted and the listeners tried again at the same

position.Table 3 andFigure 8show the result for checking

the controllability of the SCC The ratio that the controlled

sound scene by the SCC was perceived as the original sound

scene was 95% Because the SCC has a problem about the ICC

modification as previously described, the newly generated

audio sound by the SCC showed the different sound scene

in some trials

5 Conclusion

In this paper, we proposed the automatic sound scene

control system using the image sensor network for preserving

Table 3: Controllability result of the proposed system using the image sensor network

Position (degree) Same sound scene Different sound scene

Accuracy (%)

0 10 20 30 40 50 60 70 80 90 100

Position (deg) Figure 8: Controllability result of the proposed system using the image sensor network

the constant sound scene without respect to the users’ move-ment In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information

Trang 7

To estimate the direction of the human face, we used the

normalized RGB and the HSV calculated from the images

obtained by the image sensor network The direction of the

human face can be easily decided as the image sensor to

capture the image with the highest number of pixels to satisfy

the thresholds of the normalized RGB and the HSV The

estimated direction of the human face is directly fed to the

SSC module, and the controlled sound scene can be simply

generated

Experimental results show that the image sensor network

can successfully detect the human location with the accuracy

of about 98% Moreover, the controlled sound scene by the

SSC according to the detected human location was perceived

as the original sound scene with the accuracy of 95% To

enhance the performance of the image sensor network and

the SCC, more precise human localization using the eye

detection in the image remains as a future work

Conflict of Interests

The authors declare that there is no conflict of interests

regarding the publication of this paper

Acknowledgments

This study was funded by the research fund of Korea

Nazarene University in 2014 (Kwangki Kim) and supported

by Basic Science Research Program through the National

Research Foundation of Korea (NRF) funded by the

Min-istry of Education, Science, and Technology (Grant no

2012R1A1A4A01004195)

References

[1] C Faller and F Baumgarte, “Binaural cue coding—part II:

schemes and applications,” IEEE Transactions on Speech and

Audio Processing, vol 11, no 6, pp 520–531, 2003.

[2] F Baumgarte and C Faller, “Binaural cue coding—part I:

psychoacoustic fundamentals and design principles,” IEEE

Transactions on Speech and Audio Processing, vol 11, no 6, pp.

509–519, 2003

[3] J Herre, H Purnhagen, and J Breebard, “The reference model

architechture for MPEG spatial audio coding,” in Proceedings of

the 118th AES Convention, Barcelona, Spain, 2005.

[4] ISO/IEC 23003-1, “Information Technology—MPEG

AudioTechnologies—Part 1: MPEG Surround,” 2007

[5] H.-G Moon, J.-I Seo, S Baek, and K.-M Sung, “A

multi-channel audio compression method with virtual source location

information for MPEG-4 SAC,” IEEE Transactions on Consumer

Electronics, vol 51, no 4, pp 1253–1259, 2005.

[6] S Beack, J Seo, H Moon, K Kang, and M Hahn, “Angle-based

virtual source location representation for spatial audio coding,”

ETRI Journal, vol 28, no 2, pp 219–222, 2006.

[7] D A Burgess, “Techniques for Low Cost Spatial Audio,” UIST,

1992

[8] S H Foster, E M Wenzel, and R M Taylor, Real-Time Synthesis

of Complex Acoustic Environments, Crystal River Engineering,

Groveland, Calif, USA

[9] J Blauert, Spatial Hearing The Psychophysics of Human Sound

Localization, MIT Press, Cambridge, Mass, USA, 1983.

[10] K Kim, “Sound scene control of multi-channel audio signals for

realistic audio service in wired/wireless network,” International

Journal of Multimedia and Ubiquitous Engineering, vol 9, no 2,

2014

[11] E Zwicker and H Fastl, Psychoacoustics, Springer, Berlin,

Germany, 1999

[12] Y Wang and B Yuan, “A novel approach for human face detection from color images under complex background,”

Pattern Recognition, vol 34, no 10, pp 1983–1992, 2001.

[13] S.-H Kim and H.-G Kim, “Facial region detection using range

color information,” IEICE Transactions on Information and

Systems, vol 81, no 9, pp 968–975, 1998.

[14] J Yang and A Waibel, “Tracking human faces in real time,” CMU-CS-95-210, 1995

[15] V Pulkki, “Virtual sound source positioning using vector base

amplitude panning,” AES: Journal of the Audio Engineering

Society, vol 45, no 6, pp 456–465, 1997.

[16] M A Gerzon, “Panpot laws for multispeaker stereo,” in

Proceed-ings of the 92nd Convention of the AES, 1992.

[17] ISO/IEC JTC1/SC29/WG11 (MPEG), “Procedures for the Eval-uation of Spatial Audio Coding Systems,” Document N6691, Redmond, Wash, USA, 2004

Trang 8

Publishing Corporation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission However, users may print, download, or email articles for individual use.

Ngày đăng: 01/11/2022, 08:49

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN