From Speech Transmission to Acoustic Telepresence

Một phần của tài liệu .WIRELESS COMMUNICATIONSW ireless Communications, Second Edition Andreas F. Molisch © 2011 John ppsx (Trang 396 - 400)

Part IV MULTIPLE ACCESS AND ADVANCED TRANSCEIVER SCHEMES 363

15.5 From Speech Transmission to Acoustic Telepresence

The ultimate goal of speech transmission or telephony has always been to recreate the acoustic presence of a human talker at a geographically distant location. In that sense, telephony is just a special case oftelepresencewhere we attempt to augment our local reality with the virtual presence of one or more remote persons, preferably recreating the remote environment to some extent. Our short discussion starts out from simple add-on functionalities for speech transmission and progresses to the most advanced services of three-dimensional virtual audio.

15.5.1 Voice Activity Detection

In a symmetric conversation between two persons, each of the participants is silent for about 50% of the time. This fact has long been observed and exploited for multiplexing telephone conversations over the same transmission channel in a technique known as Digital Speech Interpolation (DSI). For wireless communications, multiple access to the same shared radio spectrum requires the reduction of interference created among the users. This can be achieved by transmitting speech frames over the air interface only when the talker is actively speaking, a state which is detected by a Voice Activity Detector (VAD). For an inactive speaker, Discontinuous Transmission (DTX) results in:

1. less power consumption on the mobile terminal (saving of baseband-processing and radio trans- mitter power), thereby extending battery life;

2. less multiuser interference on the air interface, thereby enhancing mobile access network per- formance;

3. less network load if packet switching is used, thereby increasing backbone network capacity.

In a typical realization, speech transmission is cut after seven nonactive frames, and a “SIlence Descriptor ” (SID, only 35 bits for 20 ms) frame is sent as a model for acoustic background noise which is used to regenerate Comfort Noise Generation (CNG) at the receiver end. This noise model is updated at least every 24 frames and can be seen as a first step to virtual reality rendering of the auditory scene, thereby maintaining coherence between the communication partners by letting the other know implicitly that the call was made while driving a car, etc.

While this DTX method is sometimes referred to as source-controlled rate adaptation, it operates at a very high source description level (i.e., only source on versus source off) and, e.g., is not able to rapidly vary the source rate according to the phonetic contents (e.g., voiced versus unvoiced speech). This is also not the same as AMR coding described above (with VAD, the source rate is either zero or the currently allowed maximum source rate, the latter still being adapted according to the channel state).

15.5.2 Receiver End Enhancements

If an entire speech frame is lost due to severe fading on the air interface, the resulting error cannot be corrected with channel codes targeted for individual or bursty bit errors. Rather,error concealment methods are required to interpolate such lost frames from received neighboring frames. This usually works very well for substitution of the first lost frame but may lead to unnatural sounding speech if continued over several frames lost in a row. In the latter case, later substituted frames will receive a gradual damping to achieve slow fadeout of the signal amplitude (typically over six frames=120 ms).

A further receiver end signal enhancement is adaptive postfiltering which constitutes filters shaped according to the spectral envelope and LTP filters which will make noise introduced by lossy coding less audible to the listener.Adaptive playout buffersorjitter buffershelp to conceal lost frames on packet-switched networks. If a terminal is equipped with wideband speech Input/Output (I/O) functionality (analog bandwidth from 50 Hz up to 7 kHz), speech received over a narrowband network may be augmented by artificially created frequency bands above 3.4 kHz and below 300 Hz using a synthesis mechanism known asbandwidth extension.

15.5.3 Acoustic Echo and Noise

The acoustic environment of a speaker may not only contain useful background information that adds to the presence of the auditory scene but also noise sources that are considered annoying, or reverberation and (acoustic) echo due to the fact that both speaking parties share a common acoustic space when using hands-free phones. In such situations, echo and noise need not only be controlled for the comfort of the human users but also for the speech-coding systems which, if strongly based on a speech signal model, will fail to handle background noise or echo appropriately. Solutions that performjoint echo and noise control will often achieve the best compromise in reducing the two impairments. An uncontrolled echo can have drastic consequences on the stability of the entire communication system; strict performance requirements have been set as standards by ITU-T in recommendations G.167 and G.168.

Speech Coding 341

15.5.4 Service Augmentation for Telepresence

Speech-Enabled Services

In a telepresence framework, users will access a much wider range of services than in conventional telephony. Given the shrinking size of mobile terminals, service access benefits a lot from spoken language dialog with the machine offering the service. This can range from simple name dialling by voice (possibly activated by a spoken magic word) to Text To Speech (TTS) synthesis for email reading or Distributed Speech Recognition (DSR) where feature vectors for speech recognition are extracted on the mobile terminal, encoded with sufficient accuracy for pattern recognition (but not necessarily for signal regeneration at the central server end), andtransmitted as data over the wireless link. This allows us to transmit recognition features based on higher quality signals than those used in telephony and it also allows us to use specific source- and channel-coding mechanisms that maximize benefits for the remote speech recognition server (rather than a human listener).

While DSR carries speech information over a data channel, the dual application of Cellular Text Telephony (CTT) allows the carriage of textual data over the speech channel so as to provide augmentative communication means for people with hearing or speaking impairments who still can exchange interactive text (not just short messages) in a chat-like style using a modem operating over the digital speech channel.

Personalization

For personalized services,talker authenticationwill be a must which should be distinguished from authentication of the mobile station or the infrastructure itself. The identity of a talker can be estab- lished via voice identification or verification techniques and, if needed, an additional personalized watermark might be inserted in the speech signal prior to encoding it.

Fortalker privacythe traditional phone booth might once be replaced by a virtual talker sphere outside of which active speech cancellation (using wearable loudspeaker arrays) would make the phone conversation hardly audible to bystanders.

Three-Dimensional Audio

The virtual talker sphere has already introduced the concept of advanced audioprocessing for speech telephony which has seen a tremendous boost from the use of microphone and loudspeaker arrays and virtual/augmented audio techniques that allow the spatial rendering (e.g., ambisonic) of three- dimensional sound fields, potentially converted to binaural headphone listening. The best effects are expected if personalized HRTFs can be used together with real-time tracking of head movements to place virtual sound sources in the context of the real environment, creating the immersive telepresence which blends the presence of virtual participants and local participants for successful teleconferencing.

Ultimately, this will require a significant shift beyond today’s speech-coding standards to allow for high-quality, multichannel audio including metadata information. A first move in this direction has been made by standardization of the AMR wideband speech codec which has become the first speech-coding standard ever to be accepted almost simultaneously for wireline communication (ITU), wireless communication (European Telecommunications Standards Institute/Third Genera- tion Partnership Project (ETSI/3GPP)), and Internet telephony (Internet Engineering Task Force, (IETF)).

Further Reading

The classical textbooks on speech coding are Jayant and Noll [1984] and Kleijn and Paliwal [1995], which, at the time of this writing, are unfortunately out of print. The excellent textbook by Vary et al. [1998] is written in German, but an improved edition in English is in preparation. Recently, a second edition of Kondoz [2004] has appeared. Another excellent overview of various aspects of speech processing is Rabiner [1994]. Source-coding theory is treated in Berger [1971], Gray [1989], and Kleijn [2005] which is maybe best suited for the needs of speech coding. For speech signal processing in general, we recommend the classic Rabiner and Schafer [1978] and the more recent textbooks of Deller et al. [2000], O’Shaughnessy [2000], and Quatieri [2002]. For extensions to nonlinear speech modeling, see Chollet et al. [2005] and Kubin [1995]. In Gersho and Gray [1992]

an extensive discussion of VQ is provided. Hanzo et al. [2001] address joint solutions for source and channel coding for wireless speech transmission. The wider context of acoustic signal processing for telecommunications is treated in Gay and Benesty [2000], Haensler and Schmidt [2004], and Vaseghi [2000]. Bandwidth extension is the main topic of Larsen and Aarts [2004]. An excellent reference for spoken language processing is Huang et al. [2001] and Jurafsky and Martin [2000].

Gibson et al. [1998] treat the compression of general multimedia data.

For updates and errata for this chapter, see wides.usc.edu/teaching/textbook

16

Equalizers

Một phần của tài liệu .WIRELESS COMMUNICATIONSW ireless Communications, Second Edition Andreas F. Molisch © 2011 John ppsx (Trang 396 - 400)

Tải bản đầy đủ (PDF)

(884 trang)