One such example is the Internet Low Bitrate Codec iLBC, which is used in a number of consumer-based peer-to-peer voice applications such as Skype.. Because the overhead of packets on mo
Trang 1Table 2.19: µ-Law Encoding Table Input Range Number of
Intervals in Range
Spacing of Intervals
Left Four Bits
of Compressed Code
Right Four Bits
of Compressed Code
leading one), and to record the base-2 (binary) exponent This is how floating-point numbers are encoded Let’s look at the previous example The number 360 is encoded in 16-bit binary as
0000 0001 0110 1000
with spaces placed every four digits for readability A-law only uses the top 13 bits Thus,
as this number is unsigned, it can be represented in floating point as
1.01101 (binary) × 2 5
The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are “0110”, and the exponent is 5 A-law then records the number as
0001 0110
where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits
A-law is used in Europe, on their telephone systems For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold The distinctions are now mostly historical
Trang 2G.711 compression preserves the number of samples, and keeps each sample independently
of the others Therefore, it is easy to figure out how the samples can be packaged into
packets or blocks They can be cut arbitrarily, and a byte is a sample This allows the codec
to be quite flexible for voice mobility, and should be a preferred option
Error concealment , or packet loss concealment (PLC), is the means by which a codec can
recover from packet loss, by faking the sound at the receiver until the stream catches up G.711 has an extension, known as G.711I, or G.711, Appendix I The most trivial error
concealment technique is to just play silence This does not really conceal the error An
additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet’s worth—until the stream catches up The problem is that, should the last sample have had a
plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on—the
plosive will be repeated, providing an effect reminiscent of a quickly skipping record player
or a 1980s science-fiction television character.* Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery
This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent
In many cases, G.711 is not even mentioned when it is being used Instead, the codec may
be referred to as PCM with µ-law or A-law encoding
2.3.1.2 G.729 and Perceptual Compression
ITU G.729 and the related G.729a specify using a more advanced encoding scheme,
which does not work sample by sample Rather, it uses mathematical rules to try to relate neighboring samples together The incoming sample stream is divided into 10ms blocks
(with 5ms from the next block also required), and each block is then analyzed as a unit
G.729 provides a 16 : 1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block
The concept around G.729 compression is to use perceptual compression to classify the type
of signal within the 10ms block The concept here is to try to figure out how neighboring
samples relate Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample) G.729 uses a couple of techniques to try to figure out what the sample must “sound like,” so it can then
throw away much of the sample and transmit only the description of the sound
To figure out what the sample block sounds like, G.729 uses Code-Excited Linear
Prediction (CELP) The idea is that the encoder and decoder have a codebook of the basics
* Max Headroom played for one season on ABC during 1987 and 1988 The title character, an artificial
intelligence, was famous for stuttering electronically.
Trang 3of sounds Each entry in the codebook can be used to generate some type of sound G.729 maintains two codebooks: one fixed, and one that adapts with the signal The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords,
which excite the airways The airways—the mouth, tongue, and so on—are then thought
of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat
The signal is first brought in and linear prediction is used Linear prediction tries to
relate the samples into the block to the previous samples, and finds the optimal mapping (“Optimal” does not always mean “good,” as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong Recall Figure 2.9.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said This is usually a simple sound, an “uhhh” or “ahhh.” The linear predictor figures out how the humming gets shaped, as a simple filter What’s left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech The linear
prediction’s effects are removed, and the remaining signal is the residue, which must relate
to the excitations The nuances are looked up in the codebook, which contains some
common residues and some others that are adaptive Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match
On the other side, the decoding process looks up the codebooks for the excitations These excitations get filtered through the linear predictor The hope is that the results sound like human speech And, of then, it does However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks
G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion
2.3.1.3 Other Codecs
There are other voice codecs that are beginning to appear in the context of voice mobility These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject These newer coders are focused on improving the error concealment, or having better delay
Trang 4or jitter tolerances, or having a richer sound One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice
applications such as Skype
Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network For example, G.711 is fine in many conditions, and G.729 might not be necessary Chapter 3 goes into some of the factors that can influence this
2.3.2 RTP
The codec defines only how the voice is compressed and packaged The voice still needs to
be placed into well-defined packets and sent over the network
The Real-time Transport Protocol (RTP), defined in RFC 3550, defines how voice is
packetized on most IP-based networks RTP is a general-purpose framework for sending
real-time streaming traffic across networks, and is used for nearly all media streaming,
including voice and video, where real-time delivery is essential
RTP is usually sent over UDP, on any port that the applications negotiate The typical RTP packet has the structure given in Table 2.20
Table 2.20: RTP Format
Flags Sequence
Number
2 bytes 2 bytes 4 bytes 4 bytes 4 bytes ×
number of contributors
variable variable
The idea behind RTP is that the sender sends the timestamp that the first byte of data in the payload belongs to This timestamp gives a precise time that the receiver can use to
reassemble incoming data The sequence number also increases monotonically, and can also
establish the order of incoming data The SSRC, for Synchronization Source, is the stream
identifier of the sender, and lets devices with multiple streams coming in figure out who is
sending The CSRCs, for Contributing Sources, are other devices that may have contributed
to the packet, such as when a conference call has multiple talkers at once
The most important fields are the timestamp (see Table 2.20) and the payload type (see
Table 2.21) The payload type field usually specifies the type of codec being used in the
stream
Trang 5Table 2.22 shows the most common voice RTP types Numbers greater than 96 are allowed, and are usually set up by the endpoints to carry some dynamic stream
When the codec’s output is packaged into RDP, it is done so to both avoid splitting
necessary information and causing too many packets per second to be sent For G.711, an RTP packet can be created with as many samples as desired for the given packet rate Common values are 20ms and 30ms Decoders know to append the samples across packets
as if they were in one stream For G.729, the RTP packet must come in 10ms multiples, because G.729 only encodes 10ms blocks An RTP packet with G.729 can have multiple blocks, and the decoder knows to treat each block separately and sequentially G.729 phones commonly stream with RTP packets holding 20ms or larger, to avoid having too many packets in the network
2.3.2.1 Secure RTP
RTP itself has a security option, designed to allow the contents of the RTP stream to be protected while still allowing the quick reassembly of a stream and the robustness of
allowing parts of the stream to be lost on the network Secure RTP (SRTP) uses the
Advanced Encryption Standard (AES) to encrypt the packets (AES will later have a starring role in Wi-Fi encryption, as well as for use with IPsec.) The RTP stream requires a key to
be established Each packet is then encrypted with AES running in counter mode, a mode
where intervening packets can be lost without disrupting the decryptability of subsequent
packets in the sequence Integrity of the packets is ensured by the use of the HMAC-SHA1
keyed signature, for each packet
How the SRTP stream gets its keys is not specified by SRTP However, SIPS provides a way for this to be set up that is quite logical The next section will discuss how this key exchange works
Table 2.22: Common RTP Packet Types
Table 2.21: The RTP Flags Field
Version Padding Extension
(X)
Contributor Count (CC)
Marked Payload Type
(PT)
Trang 62.3.3 SDP and Codec Negotiations
RTP only carries the voice, and there must be some associated way to signal the codecs
which are supported by each end This is fundamentally a property of signaling, but, unlike call progress messages and advanced PBX features, is tied specifically to the bearer channel SIP (see Section 2.2.1) uses SDP to negotiate codecs and RTP endpoints, including
transports, port numbers, and every other aspect necessary to start RTP streams flowing
SDP, defined in RFC 4566, is a text-based protocol, as SIP itself is, for setting up the
various legs of media streams Each line represents a different piece of information, in the
format of type = value.
Table 2.23: Example of an SDP Description
v=0
o=7010 1352822030 1434897705 IN IP4 192.168.0.10
s=A_conversation
c=IN IP4 192.168.0.10
t=0 0
m=audio 9000 RTP/AVP 0 8 18
a=rtpmap:0 PCMU/8000/1
a=rtpmap:8 PCMA/8000/1
a=rtpmap:18 G729/8000/1
a=ptime:20
Table 2.23 shows an example of an SDP description This description is for a phone at IP address 192.168.0.10, who wishes to receive RTP on UDP port 9000 Let’s go through each
of the fields
• Type “v” represents the protocol version, which is 0
• Type “o” holds information about the originator of this request, and the session IDs
Specifically, it is divided up into the username, session ID, session version, network
type , address type, and address “7010” happens to be the dialing phone number The
two large numbers afterward are identifiers, to keep the SDP exchanges straight The
“IN” refers to the address being an Internet protocol address; specifically, “IP4” for
IPv4, of “192.168.0.10” This is where the originator is
• Type “s” is the session name The value given here, “A_conversation”, is not
particularly meaningful
• Type “c” specifies how the originator must be reached at—its connection data This is a repetition of the IP address and type specifications for the phone
Trang 7• Type “t” is the timing for the leg of the call The first “0” represents the start time, and the second represents the end time Therefore, there is no particular timing bounds for this call
• The “m” line specifies the media needed In this case, as with most voice calls, there
is only one voice stream from the device, so there is only one media line The next
parameters are the media type, port, application, and then the list of RTP types, for
RTP This call is an “audio” call, and the phone will be listening on port 9000 This is a UDP port, because the application is “RTP/AVP”, meaning that it is plain RTP (“AVP” means that this is standard UDP with no encryption There is an “RTP/SAVP” option, mentioned shortly.) Finally, the RTP formats the phone can take are 0, 8, and 18, as specified in Table 2.22
• The next three lines are the codecs that are supported in detail The “a” field specifies
an attribute The “a=rtpmap” attribute means that the sender wants to map RTP packet
types to specific codec setups The line is formatted as packet type, encoded name/ bitrate/parameters In the first line, RTP packet type “0” is mapped to “PCMU” at 8000 samples per second The default mapping of “0” is already PCM (G.711) with µ-law, so the new information is the sample rate The second line asks for A-law, mapping it to 8 The third line asks for G.729, asking for 18 as the mapping Because the phone only listed those three types, those are the only types it supports
• The last line is also an attribute “a=ptime” is requesting that the other party send 20ms packets The other party is not required to submit to this request, as it is only a
suggestion However, this is a pretty good sign that the sender of the SDP message will also send at 20ms
The setup message in Table 2.23 was originally given in a SIP INVITE message The responding SIP OK message from the other party gave its SDP settings
Table 2.24 shows this example response Here, the other party, at IP address 10.0.0.10, wants to receive on UDP port 11690 an RTP stream with the three codecs PCMU, GSM, and PCMA It can also receive a format known as “telephone-event.” This corresponds to the RTP payload format for sending digits while in the middle of a call (RFC 4733) Some codecs, like G.729, can’t carry a dialed digit as the usual audio beep, because the beep gets distorted by the codec Instead, the digits have to be sent over RTP, embedded in the stream The sender of this SDP is stating that they support it, and would like to be sent in RTP type
101, a dynamic type that the sender was allowed to choose without restriction
Corresponding to this is the attribute “a=fmtp”, which applies to this 101-digit type “fmtp” lines don’t mean anything specific to SDP; instead, the request of “0–16” gets forwarded to the telephone event protocol handler It is not necessary to go into further details here on what “0–16” means The “a=silenceSupp” line would activate silence suppression, in which
Trang 8packets are not sent when the caller is not talking Silence suppression has been disabled, however Finally, the “a=sendrecv” line means that the originator can both send and receive streaming packets, meaning that the caller can both talk and listen Some calls are
intentionally one-way, such as lines into a voice conference where the listeners cannot
speak In that case, the listeners may have requested a flow with “a=recvonly”
After a device gets an SDP request, it knows enough information to send an RTP stream back to the requester The receiver need only choose which media type it wishes to use
There is no requirement that both parties use the same codec; rather, if the receiver cannot handle the codec, the higher-layer signaling protocol needs to reject the setup With SIP, the called party will not usually stream until it accepts the SIP INVITE, but there is no further handshaking necessary once the call is answered and there are packets to send
For SRTP usage with SIPS, SDP allows for the SRTP key to be specified using a special header:
a=crypto:1 AES_CM_128_HMAC_SHA1_32 ⇒
inline:c3bFaGA+Seagd117041az3g113geaG54aKgd50Gz
This specifies that the SRTP AES counter with HMAC_SHA1 is to be used, and specifies the key, encoded in base-64, that is to be used Both sides of the call send their own
randomly generated keys, under the cover of the TLS-protected link This forms the basis of RTP/SAVP
Table 2.24: Example of an SDP Responding Description
v=0
o=root 10871 10871 IN IP4 10.0.0.10
s=session
c=IN IP4 10.0.0.10
t=0 0
m=audio 11690 RTP/AVP 0 3 8 101
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=silenceSupp:off
-a=ptime:20
a=sendrecv
Trang 9Elements of Voice Quality
3.0 Introduction
This chapter examines the factors that go into voice quality First, we will look at how voice quality was originally, and introduce the necessary measurement metrics (MOS and
R-Value) After that, we will move on to describing the basis for repeatable, objective metrics, by a variety of models that start out taking actual voice samples into account, but then turn into guidelines and formulas about loss and delay that can be used to predict network quality Keep in mind that the point of the chapter is not to substitute for thousand-page telephony guidelines, but to introduce the reader to the basics for what it takes, and argue that perhaps—with mobility—the exactitude typically expected in static voice
deployments is not that useful
3.1 What Voice Quality Really Means
Chapter 2 laid the groundwork for how voice is carried But what makes some phone calls sound better than others? Why do some voice mobility networks sound tinny and robotic, where others sound natural and clear? In voice, there are two ways to look at voice quality: gather a number of people and survey them about the quality of the call, or try to use some sort of electronic measurement and deduce, from there, what a real person might think
3.1.1 Mean Opinion Score and How It Sounds
The Mean Opinion Score, or MOS (sometimes redundantly called the MOS score), is one
way of ranking the quality of a phone call This score is set on a five-point scale, according
to the following ranking:
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad
MOS never goes below 1, or above 5
Trang 10There is quite a science to establishing how to measure MOS based on real-world human studies, and the depth they go into is astounding ITU P.800 lays out procedures for
measuring MOS Annex B of P.800 defines listening tests to determine quality in an
absolute manner The test requirements are spelt out in detail The room to be used should
be between 30 and 120 cubic meters, to ensure the echo remains within known values The phone under test is used to record a series of phrases The listeners are brought in, having been selected from a group that has never heard the recorded sentence lists, in order to avoid bias The listeners are asked to mark the quality of the played-back speech, distorted
as it may be by the phone system The listeners’ scores, on the one-to-five scale, are
averaged, and this becomes the MOS for the system The goal of all of this is to attempt to increase the repeatability of such experiments
Clearly, performing MOS tests is not something that one would imagine can be done for most voice mobility networks However, the MOS scale is so well known that the 1 to 5 scale is used as the standard yardstick for all voice quality metrics The most important rule
of thumb for the MOS scale is this: a MOS of 4.0 or better is toll-quality This is the quality that voice mobility networks have to achieve, because this is the quality that nonmobility voice networks provide every day Forgiveness will likely offered by users when the
problem is well known and entirely relatable, such as for bad-quality calls when in a poor cellular coverage area But, once inside the building, enterprise voice mobility users expect the same quality wirelessly as they do when using their desk phone
Thus, when a device reports the MOS for a call, the number you are seeing has been
generated electronically, based on formulas that are thought to be reasonable facsimiles of the human experience
3.1.2 PESQ: How to Predict MOS Using Mathematics
Therefore, we turn to how the predictions of voice quality can actually be made
electronically ITU P.862 introduces Perceptual Evaluation of Speech Quality, the PESQ
metric PESQ is designed to take into account all aspects of voice quality, from the
distortion of the codecs themselves to the effects of filtering, delay variation, and dropouts
or strange distortions PESQ was verified with a number of real MOS experiments to make sure that the numbers are reasonable within the range of normal telephone voices
PESQ is measured on a 1 to 4.5 scale, aligning exactly with the 1 to 5 MOS scale, in the sense that a 1 is a 1, a 2 is a 2, and so on (The area from 4.5 to 5 in PESQ is not addressed.) PESQ is designed to take into account many different factors that alter the perception of the quality of voice
The basic concept of PESQ is to have a piece of software or test measurement equipment compare two versions of a recording: the original one and the one distorted by the telephone