Scalable voip mobility intedration and deployment- P6 pdf

One such example is the Internet Low Bitrate Codec iLBC, which is used in a number of consumer-based peer-to-peer voice applications such as Skype.. Because the overhead of packets on mo

Trang 1

Table 2.19: µ-Law Encoding Table Input Range Number of

Intervals in Range

Spacing of Intervals

Left Four Bits

of Compressed Code

Right Four Bits

of Compressed Code

leading one), and to record the base-2 (binary) exponent This is how floating-point numbers are encoded Let’s look at the previous example The number 360 is encoded in 16-bit binary as

0000 0001 0110 1000

with spaces placed every four digits for readability A-law only uses the top 13 bits Thus,

as this number is unsigned, it can be represented in floating point as

1.01101 (binary) × 2 5

The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are “0110”, and the exponent is 5 A-law then records the number as

0001 0110

where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits

A-law is used in Europe, on their telephone systems For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold The distinctions are now mostly historical

Trang 2

G.711 compression preserves the number of samples, and keeps each sample independently

of the others Therefore, it is easy to figure out how the samples can be packaged into

packets or blocks They can be cut arbitrarily, and a byte is a sample This allows the codec

to be quite flexible for voice mobility, and should be a preferred option

Error concealment , or packet loss concealment (PLC), is the means by which a codec can

recover from packet loss, by faking the sound at the receiver until the stream catches up G.711 has an extension, known as G.711I, or G.711, Appendix I The most trivial error

concealment technique is to just play silence This does not really conceal the error An

additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet’s worth—until the stream catches up The problem is that, should the last sample have had a

plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on—the

plosive will be repeated, providing an effect reminiscent of a quickly skipping record player

or a 1980s science-fiction television character.* Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery

This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent

In many cases, G.711 is not even mentioned when it is being used Instead, the codec may

be referred to as PCM with µ-law or A-law encoding

2.3.1.2 G.729 and Perceptual Compression

ITU G.729 and the related G.729a specify using a more advanced encoding scheme,

which does not work sample by sample Rather, it uses mathematical rules to try to relate neighboring samples together The incoming sample stream is divided into 10ms blocks

(with 5ms from the next block also required), and each block is then analyzed as a unit

G.729 provides a 16 : 1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block

The concept around G.729 compression is to use perceptual compression to classify the type

of signal within the 10ms block The concept here is to try to figure out how neighboring

samples relate Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample) G.729 uses a couple of techniques to try to figure out what the sample must “sound like,” so it can then

throw away much of the sample and transmit only the description of the sound

To figure out what the sample block sounds like, G.729 uses Code-Excited Linear

Prediction (CELP) The idea is that the encoder and decoder have a codebook of the basics

* Max Headroom played for one season on ABC during 1987 and 1988 The title character, an artificial

intelligence, was famous for stuttering electronically.

Trang 3

of sounds Each entry in the codebook can be used to generate some type of sound G.729 maintains two codebooks: one fixed, and one that adapts with the signal The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords,

which excite the airways The airways—the mouth, tongue, and so on—are then thought

of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat

The signal is first brought in and linear prediction is used Linear prediction tries to

relate the samples into the block to the previous samples, and finds the optimal mapping (“Optimal” does not always mean “good,” as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong Recall Figure 2.9.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said This is usually a simple sound, an “uhhh” or “ahhh.” The linear predictor figures out how the humming gets shaped, as a simple filter What’s left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech The linear

prediction’s effects are removed, and the remaining signal is the residue, which must relate

to the excitations The nuances are looked up in the codebook, which contains some

common residues and some others that are adaptive Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match

On the other side, the decoding process looks up the codebooks for the excitations These excitations get filtered through the linear predictor The hope is that the results sound like human speech And, of then, it does However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks

G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion

2.3.1.3 Other Codecs

There are other voice codecs that are beginning to appear in the context of voice mobility These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject These newer coders are focused on improving the error concealment, or having better delay

Trang 4

or jitter tolerances, or having a richer sound One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice

applications such as Skype

Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network For example, G.711 is fine in many conditions, and G.729 might not be necessary Chapter 3 goes into some of the factors that can influence this

2.3.2 RTP

The codec defines only how the voice is compressed and packaged The voice still needs to

be placed into well-defined packets and sent over the network

The Real-time Transport Protocol (RTP), defined in RFC 3550, defines how voice is

packetized on most IP-based networks RTP is a general-purpose framework for sending

real-time streaming traffic across networks, and is used for nearly all media streaming,

including voice and video, where real-time delivery is essential

RTP is usually sent over UDP, on any port that the applications negotiate The typical RTP packet has the structure given in Table 2.20

Table 2.20: RTP Format

Flags Sequence

Number

2 bytes 2 bytes 4 bytes 4 bytes 4 bytes ×

number of contributors

variable variable

The idea behind RTP is that the sender sends the timestamp that the first byte of data in the payload belongs to This timestamp gives a precise time that the receiver can use to

reassemble incoming data The sequence number also increases monotonically, and can also

establish the order of incoming data The SSRC, for Synchronization Source, is the stream

identifier of the sender, and lets devices with multiple streams coming in figure out who is

sending The CSRCs, for Contributing Sources, are other devices that may have contributed

to the packet, such as when a conference call has multiple talkers at once

The most important fields are the timestamp (see Table 2.20) and the payload type (see

Table 2.21) The payload type field usually specifies the type of codec being used in the

stream

Trang 5

Table 2.22 shows the most common voice RTP types Numbers greater than 96 are allowed, and are usually set up by the endpoints to carry some dynamic stream

When the codec’s output is packaged into RDP, it is done so to both avoid splitting

necessary information and causing too many packets per second to be sent For G.711, an RTP packet can be created with as many samples as desired for the given packet rate Common values are 20ms and 30ms Decoders know to append the samples across packets

as if they were in one stream For G.729, the RTP packet must come in 10ms multiples, because G.729 only encodes 10ms blocks An RTP packet with G.729 can have multiple blocks, and the decoder knows to treat each block separately and sequentially G.729 phones commonly stream with RTP packets holding 20ms or larger, to avoid having too many packets in the network

2.3.2.1 Secure RTP

RTP itself has a security option, designed to allow the contents of the RTP stream to be protected while still allowing the quick reassembly of a stream and the robustness of

allowing parts of the stream to be lost on the network Secure RTP (SRTP) uses the

Advanced Encryption Standard (AES) to encrypt the packets (AES will later have a starring role in Wi-Fi encryption, as well as for use with IPsec.) The RTP stream requires a key to

be established Each packet is then encrypted with AES running in counter mode, a mode

where intervening packets can be lost without disrupting the decryptability of subsequent

packets in the sequence Integrity of the packets is ensured by the use of the HMAC-SHA1

keyed signature, for each packet

How the SRTP stream gets its keys is not specified by SRTP However, SIPS provides a way for this to be set up that is quite logical The next section will discuss how this key exchange works

Table 2.22: Common RTP Packet Types

Table 2.21: The RTP Flags Field

Version Padding Extension

(X)

Contributor Count (CC)

Marked Payload Type

(PT)

Trang 6

2.3.3 SDP and Codec Negotiations

RTP only carries the voice, and there must be some associated way to signal the codecs

which are supported by each end This is fundamentally a property of signaling, but, unlike call progress messages and advanced PBX features, is tied specifically to the bearer channel SIP (see Section 2.2.1) uses SDP to negotiate codecs and RTP endpoints, including

transports, port numbers, and every other aspect necessary to start RTP streams flowing

SDP, defined in RFC 4566, is a text-based protocol, as SIP itself is, for setting up the

various legs of media streams Each line represents a different piece of information, in the

format of type = value.

Table 2.23: Example of an SDP Description

v=0

o=7010 1352822030 1434897705 IN IP4 192.168.0.10

s=A_conversation

c=IN IP4 192.168.0.10

t=0 0

m=audio 9000 RTP/AVP 0 8 18

a=rtpmap:0 PCMU/8000/1

a=rtpmap:8 PCMA/8000/1

a=rtpmap:18 G729/8000/1

a=ptime:20

Table 2.23 shows an example of an SDP description This description is for a phone at IP address 192.168.0.10, who wishes to receive RTP on UDP port 9000 Let’s go through each

of the fields

• Type “v” represents the protocol version, which is 0

• Type “o” holds information about the originator of this request, and the session IDs

Specifically, it is divided up into the username, session ID, session version, network

type , address type, and address “7010” happens to be the dialing phone number The

two large numbers afterward are identifiers, to keep the SDP exchanges straight The

“IN” refers to the address being an Internet protocol address; specifically, “IP4” for

IPv4, of “192.168.0.10” This is where the originator is

• Type “s” is the session name The value given here, “A_conversation”, is not

particularly meaningful

• Type “c” specifies how the originator must be reached at—its connection data This is a repetition of the IP address and type specifications for the phone

Trang 7

• Type “t” is the timing for the leg of the call The first “0” represents the start time, and the second represents the end time Therefore, there is no particular timing bounds for this call

• The “m” line specifies the media needed In this case, as with most voice calls, there

is only one voice stream from the device, so there is only one media line The next

parameters are the media type, port, application, and then the list of RTP types, for

RTP This call is an “audio” call, and the phone will be listening on port 9000 This is a UDP port, because the application is “RTP/AVP”, meaning that it is plain RTP (“AVP” means that this is standard UDP with no encryption There is an “RTP/SAVP” option, mentioned shortly.) Finally, the RTP formats the phone can take are 0, 8, and 18, as specified in Table 2.22

• The next three lines are the codecs that are supported in detail The “a” field specifies

an attribute The “a=rtpmap” attribute means that the sender wants to map RTP packet

types to specific codec setups The line is formatted as packet type, encoded name/ bitrate/parameters In the first line, RTP packet type “0” is mapped to “PCMU” at 8000 samples per second The default mapping of “0” is already PCM (G.711) with µ-law, so the new information is the sample rate The second line asks for A-law, mapping it to 8 The third line asks for G.729, asking for 18 as the mapping Because the phone only listed those three types, those are the only types it supports

• The last line is also an attribute “a=ptime” is requesting that the other party send 20ms packets The other party is not required to submit to this request, as it is only a

suggestion However, this is a pretty good sign that the sender of the SDP message will also send at 20ms

The setup message in Table 2.23 was originally given in a SIP INVITE message The responding SIP OK message from the other party gave its SDP settings

Table 2.24 shows this example response Here, the other party, at IP address 10.0.0.10, wants to receive on UDP port 11690 an RTP stream with the three codecs PCMU, GSM, and PCMA It can also receive a format known as “telephone-event.” This corresponds to the RTP payload format for sending digits while in the middle of a call (RFC 4733) Some codecs, like G.729, can’t carry a dialed digit as the usual audio beep, because the beep gets distorted by the codec Instead, the digits have to be sent over RTP, embedded in the stream The sender of this SDP is stating that they support it, and would like to be sent in RTP type

101, a dynamic type that the sender was allowed to choose without restriction

Corresponding to this is the attribute “a=fmtp”, which applies to this 101-digit type “fmtp” lines don’t mean anything specific to SDP; instead, the request of “0–16” gets forwarded to the telephone event protocol handler It is not necessary to go into further details here on what “0–16” means The “a=silenceSupp” line would activate silence suppression, in which

Trang 8

packets are not sent when the caller is not talking Silence suppression has been disabled, however Finally, the “a=sendrecv” line means that the originator can both send and receive streaming packets, meaning that the caller can both talk and listen Some calls are

intentionally one-way, such as lines into a voice conference where the listeners cannot

speak In that case, the listeners may have requested a flow with “a=recvonly”

After a device gets an SDP request, it knows enough information to send an RTP stream back to the requester The receiver need only choose which media type it wishes to use

There is no requirement that both parties use the same codec; rather, if the receiver cannot handle the codec, the higher-layer signaling protocol needs to reject the setup With SIP, the called party will not usually stream until it accepts the SIP INVITE, but there is no further handshaking necessary once the call is answered and there are packets to send

For SRTP usage with SIPS, SDP allows for the SRTP key to be specified using a special header:

a=crypto:1 AES_CM_128_HMAC_SHA1_32 ⇒

inline:c3bFaGA+Seagd117041az3g113geaG54aKgd50Gz

This specifies that the SRTP AES counter with HMAC_SHA1 is to be used, and specifies the key, encoded in base-64, that is to be used Both sides of the call send their own

randomly generated keys, under the cover of the TLS-protected link This forms the basis of RTP/SAVP

Table 2.24: Example of an SDP Responding Description

v=0

o=root 10871 10871 IN IP4 10.0.0.10

s=session

c=IN IP4 10.0.0.10

t=0 0

m=audio 11690 RTP/AVP 0 3 8 101

a=rtpmap:0 PCMU/8000

a=rtpmap:3 GSM/8000

a=rtpmap:8 PCMA/8000

a=rtpmap:101 telephone-event/8000

a=fmtp:101 0-16

a=silenceSupp:off

-a=ptime:20

a=sendrecv

Trang 9

Elements of Voice Quality

3.0 Introduction

This chapter examines the factors that go into voice quality First, we will look at how voice quality was originally, and introduce the necessary measurement metrics (MOS and

R-Value) After that, we will move on to describing the basis for repeatable, objective metrics, by a variety of models that start out taking actual voice samples into account, but then turn into guidelines and formulas about loss and delay that can be used to predict network quality Keep in mind that the point of the chapter is not to substitute for thousand-page telephony guidelines, but to introduce the reader to the basics for what it takes, and argue that perhaps—with mobility—the exactitude typically expected in static voice

deployments is not that useful

3.1 What Voice Quality Really Means

Chapter 2 laid the groundwork for how voice is carried But what makes some phone calls sound better than others? Why do some voice mobility networks sound tinny and robotic, where others sound natural and clear? In voice, there are two ways to look at voice quality: gather a number of people and survey them about the quality of the call, or try to use some sort of electronic measurement and deduce, from there, what a real person might think

3.1.1 Mean Opinion Score and How It Sounds

The Mean Opinion Score, or MOS (sometimes redundantly called the MOS score), is one

way of ranking the quality of a phone call This score is set on a five-point scale, according

to the following ranking:

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

MOS never goes below 1, or above 5

Trang 10

There is quite a science to establishing how to measure MOS based on real-world human studies, and the depth they go into is astounding ITU P.800 lays out procedures for

measuring MOS Annex B of P.800 defines listening tests to determine quality in an

absolute manner The test requirements are spelt out in detail The room to be used should

be between 30 and 120 cubic meters, to ensure the echo remains within known values The phone under test is used to record a series of phrases The listeners are brought in, having been selected from a group that has never heard the recorded sentence lists, in order to avoid bias The listeners are asked to mark the quality of the played-back speech, distorted

as it may be by the phone system The listeners’ scores, on the one-to-five scale, are

averaged, and this becomes the MOS for the system The goal of all of this is to attempt to increase the repeatability of such experiments

Clearly, performing MOS tests is not something that one would imagine can be done for most voice mobility networks However, the MOS scale is so well known that the 1 to 5 scale is used as the standard yardstick for all voice quality metrics The most important rule

of thumb for the MOS scale is this: a MOS of 4.0 or better is toll-quality This is the quality that voice mobility networks have to achieve, because this is the quality that nonmobility voice networks provide every day Forgiveness will likely offered by users when the

problem is well known and entirely relatable, such as for bad-quality calls when in a poor cellular coverage area But, once inside the building, enterprise voice mobility users expect the same quality wirelessly as they do when using their desk phone

Thus, when a device reports the MOS for a call, the number you are seeing has been

generated electronically, based on formulas that are thought to be reasonable facsimiles of the human experience

3.1.2 PESQ: How to Predict MOS Using Mathematics

Therefore, we turn to how the predictions of voice quality can actually be made

electronically ITU P.862 introduces Perceptual Evaluation of Speech Quality, the PESQ

metric PESQ is designed to take into account all aspects of voice quality, from the

distortion of the codecs themselves to the effects of filtering, delay variation, and dropouts

or strange distortions PESQ was verified with a number of real MOS experiments to make sure that the numbers are reasonable within the range of normal telephone voices

PESQ is measured on a 1 to 4.5 scale, aligning exactly with the 1 to 5 MOS scale, in the sense that a 1 is a 1, a 2 is a 2, and so on (The area from 4.5 to 5 in PESQ is not addressed.) PESQ is designed to take into account many different factors that alter the perception of the quality of voice

The basic concept of PESQ is to have a piece of software or test measurement equipment compare two versions of a recording: the original one and the one distorted by the telephone

Định dạng
Số trang	10
Dung lượng	175,12 KB