Just as audio has the obvious frequency components, representing the set of pitches that are being heard at a given time, video has two dimensions horizontal and vertical of frequencies
Trang 1pixels in the device’s display, as there is not much sense in using a video stream that has a higher resolution than the display
Each pixel holds the color information for that pixel This makes a pixel a rectangular area
of uniform color Computer representations of human-visible color can be measured in
many different ways—which is both a problem and an opportunity For raw images, which
have no compression, one common way to measure color is to use the RGB method RGB
relies on how the human eye can see three primary colors—red, green, and blue—and that all colors are some combination of those three For this reason, computer and television
displays are made with pixels that are themselves made of the three different colors, each lit with independent intensity The perception of color is a rich and detailed subject, but there are some concepts to it that we should briefly address here to give a sense for what is going
on Pure light, as electromagnetic radiation, can be thought of as having one frequency The colors of the rainbow are all colors of one pure “tone,” or light of one single frequency
The eye has cones, the receptors that show color, and these cones come in three types, not
surprisingly one for red, one for blue, and one for green The three cones will respond
differently for each pure, spectral color, giving the eye good coverage over all of the colors that can exist in the world When light of one frequency falls upon the cones of the eye,
each type cone responds differently and predictably From this comes the perception of the color of that frequency
Of course, not all colors are made of pure color White is most certainly not a pure color, as
it is able to be split by a prism into the entire rainbow spectrum It must be more than just one frequency, then In fact, it is a balanced mixture of many frequencies, and comes out white because it is able to excite the three primary color cones the eye sees with even
intensity This is no coincidence, as while happens to represent the total mixture of
frequencies the sun radiates down on the planet Because everything we see is reflected
from normally white light, our most common experiences of colors have to belong to the subset of white light itself
We can think of representing color by, then, a triplet of intensities, representing the amount
of activation of the three cones in the eye This is a good start As it happens, however,
video displays are not built around the exact excitement of the three cones The red, blue, and green are a bit different than what the eye sees, mostly because display screens need to generate light from something that could be mass produced, and it was sufficient to use a different red, green, and blue, so long as the entire range of colors could get close to
running the gamut of what the eye can see (Gamut is the correct term for the space of
colors that a particular way of representing colors can cover.) This means that RGB can
represent essentially all of the colors we know of and use on a regular basis, though often
as only an approximation to the real thing There are some colors in nature that can never
be seen on a screen—often those that are the most vibrant
Trang 2However, we can still get a sense of how the primary RGB colors of light add up Red and blue readily mix to form purple, which is easy to describe as a reddish-blue hue Blue and green mix to form a blue-green color, such as cyan The odd one is red and green together, which produces yellow, a color most people would not describe with the term red-green, being a pair of colors which seem as opposite as can be imagined (It is difficult to imagine that the two colors of a poinsettia can ever combine to form the color of a banana.) All three together form white, and all three absent form black Staring at any screen up close will show exactly how this process works in video displays to blend together to form a
color, as each pixel is made of subpixels, or regions of only one color, that can only vary by
intensity When viewed from a reasonable distance, the eye does not do a good job seeing the separate subpixels, and so it blends the three different hues together to form the final, intended, color
In computer displays, the widely-used sampling method is to given each primary color an
intensity of eight bits, producing a 24-bit overall color sample per pixel This lets us step back and look at the size of video Compared to raw voice, which must encode only one 16-bit sample at a given time, raw video must encode hundreds of thousands of 24-bit color samples
Even though the number of times a second the set of pixels, the picture known as a frame is far
less—the size of a frame quickly dominates On standard video, the picture (and thus pixel color intensities) changes up to 30 times a second, far less often than the 8000 times a second that the voice intensity changes Multiplying it out, a raw voice stream requires 128,000 bits a second; a raw 640×480 video at 30 frames per second requires 221,184,000 bits per second for just the video portion, not including any associated audio streams
9.1.2 Video Compression
The large size of video clearly begs for compression Fortunately, much of the detail in video is wasted on the viewer, and video has a tremendous amount of room for lossy compression to be employed Video compression is an area of active research, but the basics are easy to understand, and are used in modern compression algorithms, such as those for
9.1.2.1 Still Image Compression
The first area we can look to compress video is with the representation of color Many of you who may read this book may also remember how color digital displays evolved, and thus would have a natural understanding of how excessive 16,777,216 possible colors per pixel once seemed However, it does bear repeating The eye can be challenged, by certain color transitions, or changes of color from one to the next, to need all 24 bits to capture most of the range of differences the eye can see But, ignoring those specific, challenging situations, the eye can really see only a few thousand colors Furthermore, if exact color reproduction is not the point—and it is not, for video—making minor approximations here
Trang 3or there is quite acceptable Therefore, the first technique for video compression is to
radically reduce the number of colors, following the usual media compression technique of focusing the bits to where they are most perceivable by the human observer, and then filling
in the details with bits that may not be kept or can be afforded to be lost
Although the red, green, and blue representation works quite well for representing what the video display needs to do, it is not the most obvious choice for representing
human-perceived color We can take a hint from the development of analog television Intensity— the overall intensity of the pixel—matters the most Thus, black and white works quite well for representing the subject of the video, so long as people are not dressed with garish
colors that would go missed Because intensity is so important, representing the intensity is the best use of the bits of information for a pixel Once the black-and-white intensity is
known, the remaining bits can be used to give the shade of pixel a tint, or hue Three
primary colors mean three additional hue values, right? Thankfully not If we think of color
as a mathematical vector of three dimensions, knowing the intensity is like knowing the
length We only need to know how far the color extends into two of the three dimensions, along with the length, to get the intended color back out (This is embedded in the very
definition of having three dimensions to color.) The two choices that television designers settled on was to record both how red and how blue the color is, leaving green to be
inferred It starts with an intensity, also known as a luminance in this representation, such as highest for white and lowest for black The tinting, using the two chrominances, each add
or subtract an amount of the primary color they represent Together, the luminance and two chrominances create another 24-bit value, often represented by the abbreviation YCbCr
(Y is the standard symbol for luminance, and C is for chrominance) White can be
represented, in percentages, as 100% Y, 0% Cb, and 0% Cr, for (100, 0, 0) Black is
(0, 0, 0), and middle gray is (50, 0, 0) From white, we can get cyan, which has full blue and green but no red This requires subtracting off the full measure of red Therefore, a
nearly pure cyan is (100, −100, 0) Similarly, a nearly pure green is (100, −100, −100),
removing both the red and blue components The qualification of nearly is used only
because the standard weightings for the YCbCr space require a bit of tweaking to get to the pure RGB versions of the color (YPbPr, the name for component video cables such as used with HD televisions, refers to the same concept, but for analog signals.)
Given that the eye is most particular to the precision of the intensity, and less so to the
precision of the tinting, video compression will commonly change the ratio of information rate for each Specifically, the video compression can require that some of the information
be held the same across multiple pixels By halving the amount of information for the
chrominance, the 4:2:2 encoding stores twice as much information to luminance than to the
blue and the red chrominances, by requiring that the chrominances change only every other pixel horizontally Simply halving the amount of information in one pixel dedicated to
chrominance would result in saving one-third of the bits The sacrifice is in color fidelity
Trang 4when the color changes, but for video—and often for still images as well—this may
not matter, as the eye considered to be roughly have as sensitive to the tint as it is to
brightness There is also a 4:2:0 “ratio,” which is actually more common 4:2:0 is a
special term that means making squares of two pixels by two pixels share the same
chrominance
This sort of color compression falls into the category of quantization compression, where
the goal is to represent either a continuous or more precise value with a less precise
quantized value Quantization compression was also used in voice, for the two logarithmic encoders in G.711 That compression cut the number of bits in half, but it was smarter than just dividing the signal by 256, as the quantization steps between two encoded values do not have to be even, and logarithmic encoding concentrated the slices more towards the smaller signal values 4:2:2 compression for video concentrates the bits more to luminance
Even with reducing the chrominance, video compression will also quantize the luminance, based on the range and precision that it actually achieves over the area being quantized This is a fundamental part of encoding still images and moving videos, as well Because the range of intensities within an image varies from part to part, this sort of quantization is best done in small chunks, or regions, of the image Once quantized, the encoding must have the particular parameters used to make the quantization happen
As a general rule, the hard part of media compression is finding a representation that has
the bits shuffled and applied to categories that matter differently—from there, compression
can be achieved simply by chopping bits from the categories (high-intensity audio samples, chrominance) that matter the least This can be clearly seen than with the base for modern
image and video compression, as found in JPEG and MPEG formats Cutting colors and
rescaling to pack in bits where they matter most is one thing But the designers of JPEG thought of the image differently, looking at its frequency components Just as audio has the obvious frequency components, representing the set of pitches that are being heard at a given time, video has two dimensions (horizontal and vertical) of frequencies at any given point The thinking behind JPEG is that the higher frequencies represent the presence of detail, and the lower frequencies represent the slight variations In audio compression, converting the signal to frequencies is useful because some frequencies are not as important, and producing what amounts to a rank ordering of the most important frequencies at a given time allows the important parts of the signal to be preserved, while the less important parts—such as the faint but often highly detailed noises in the background of a recording— can be erased or approximated more easily The same applies to video
(If you are thinking, by now, that color, being comprised of multiple pure tones or
frequencies of light, could benefit from being represented this way, then you are on the right path Color itself, however, happens to not be a good example for being represented and then compressed this way, because the eye already does such a good job removing most of
Trang 5the information out of light by forcing it from an infinite-dimensional space of continuous functions to a three-dimensional space of primary colors, with enough tolerance for
approximation.)
One method used to convert an image (or any signal) from space-defined pixels to
frequencies is to take the Fourier transform of the image It’s easier to think of Fourier
transforms first with audio We know that a sound can be made of one or more pitches A chord can contain, for example, the four sounds of middle D flat, E flat, F, and A flat
(producing a Dbadd2 chord), and those four tones will be the four most important
frequencies while the chord is being played Of course, the instrument or instruments
playing the chord each produce a number of both similar and widely different tones around the main tone of the note, and losing those would lose the character of the instrument
completely But, by seeing that some pitches are more important than others, we can begin
to see how the pitches can be ranked The Fourier transform does not do any ranking itself
It is purely mathematical—a change in representation (really, linear basis) from time to
frequency and back The notion is rather simple Overlap the signal with the one for each pure tone or pitch The higher the overlap, the more that pitch is present in the signal We can think of the Fourier transform as testing the signal for overlap with each and every pure tone Overlap for two signals, or functions, can simply be thought of as the sum over the entire signal of the product, or multiple, of the signal and the “test” tone If the tone
overlaps well with the signal, then each value in the pure test tone will multiply with that of the pure tone embedded in the real signal, and the sum of this will add up to produce a
large number However, if the pure tone is not present in the original signal, then the sums will all go out of step, and the result will be small Figure 9.1 shows this in action
If two signals are out of phase, but have the same frequency, then the sum of the product can go to 0, even though there is a match But using complex numbers (see Chapter 5), the phase can be captured without ambiguity or mistake It is this process that produces the
Fourier transform For math’s sake, we can write it as
F( )ω =∫f t e( ) −i tωdt
where F is the Fourier-transformed representation of f, based on the angular frequency ω,
which is 2π times the frequency
What does all this mean? It means that we have a mathematical way of converting a signal
to its frequencies A continuous signal has a continuous (infinite) number of frequencies But, with the digital world, signals are always finite and discrete We can shift from the
continuous Fourier transform and go to the discrete variant, however This variant uses the
same math, but replaces the integral with a discrete sum The result of a discrete Fourier
transform of a signal of a certain number of samples is a new signal with the same number
Trang 6Time
a) Original signal 2 cos(10 · 2 p x) + cos(30 · 2 p x), composed of two signals: one at a frequency of 30Hz, and another,
half as strong, at a frequency of 10Hz.
0
Intensity
Time
c) Overlap for 10Hz test signal in original signal, which does have a 10Hz component Notice how the running sum, representing the amount of overlap the test signal has with the original as a running sum of the product of the two signals at each point, steadily increases This 10Hz test signal is a match, and there will be a peak in the Fourier transform at 10Hz.
0
Intensity
Frequency
b) Fourier Transform, or frequency plot, of the same signal Noice how there are strong peaks at both 10Hz and 30Hz, with the 30Hz peak having twice the intensity This is the advantage of the Fourier transform, which pulls out the frequencies present in a signal.
0
Original signal 10Hz test function Running sum (integral)
Intensity
Time
d) Overlap for 20Hz test signal in original signal, which does not have a 20Hz component The running sum now doesn’t steadily increase, from left to right, but instead vacillates around zero, as expected because the test signal now does not overlap with the original signal This 20Hz test signal is not a match, and there will
be no peak in the Fourier transform at 20Hz.
0
Original signal 10Hz test function Running sum (integral)
Figure 9.1: Fourier Transform
Trang 7of samples, but with the first sample representing the lowest frequency—the larger the
sample, the more this frequency is present in the original signal—and so on
The method now begins to become clear Take the signal, then the discrete Fourier
transform of it The most important frequencies will have the largest value, and less
important frequencies will have smaller values Assigning more bits to the larger
frequencies and less bits to the smaller by quantization will compress the signal Most of the frequency components will actually get compressed to zero, or completely removed,
when compression is successful
Coming back to video, the thought is the same The discrete Fourier transform—and its
variant, the discrete cosine transform (DCT), which works with entirely real numbers (no
imaginary numbers)—can work in both the horizontal and vertical directions, to capture the frequencies present in the image Now, it may not seem that a video image obviously has frequencies Over the entire image, it probably doesn’t have any ones that can be seen
intuitively But the trick is to divide the image up into small rectangles—the size depends
on a number of factors, such as how much or little the part of the image in the rectangle varies—and do the frequency transform and subsequent quantization for each rectangle
Now, the benefit for frequency information can begin to make sense A rectangle whose
image barely changes, such as background shading, has very little frequency information, and so can be compressed greatly But even areas that represent real shapes can be
compressed rather well, with a lot of loss but preserving the rough character of that part of the image As long as each rectangle does not have a lot of irregularity in it, the
compression will be good for each rectangle From here, the compressors try to figure out how to make each rectangle as large as possible for the same bits, looking for areas with lots of similarity
This adaptive sizing of rectangles, and the rectangles themselves, can be seen fairly easy on compressed images, but usually happens away from the action (intentionally, as we will
see) Let’s take the example in Figure 9.2
The lefthand image is the original, and the righthand is with compression set high—as
would happen away from the action Right away, you can see the outlines of the rectangles,
in this case, all of the same size, as this is a JPEG image Most of the squares, such as those for the sky and the solid parts of the building—wherever there is not a lot of detail—got compressed down to no frequency components in any direction: a solid color But some
areas had more than one frequency component that wasn’t compressed to zero If you look
at the right side of the tower, where it meets the sky, you will notice that all of the squares seem to have vertical bands, and thus are horizontally smooth This happens because one frequency component in the horizontal direction got retained, which makes sense for a
mostly vertical shape Squares with more than one component in each direction can be seen where people’s heads are
Trang 89.1.2.2 Motion Compression
Once a still frame has been compressed, we can move on to compressing the motion itself The simple way of doing this would be to just have a sequence of compressed images, one
compressed image for each frame This is actually done in a format called Motion JPEG
However, doing so would not take advantage of the fact that most frames are nearly
identical, with similar backgrounds but different images of the active subjects as they move around
We can think of compressing the motion itself by starting out with the first frame, a
compressed still image like any other But, for the following frame, imagine not storing another compressed image Instead, just store which pixels in the image, corresponding to a moving subject, have moved, and in which direction they have moved The decoder will just copy those pixels forward, moving them according to the directions the encoder gives The only new pixels the encoder needs to send are those for the background that got revealed as the moving subject moved away from them As most objects are pretty solid or large, the pixels in them move together, and so encoding regions of essentially uniform
motion is fairly simple and highly efficient From this comes the dual concept of the key
frame (or I frame for intra-coded), the first frame in the sequence with the complete set of pixels, and the intermediate frames (or P frame for predictive), the following frames that
only carry the motion and the newly revealed pixels
This would work just fine, except that viewers like to jump around in videos, or
intermediate frames might get lost somewhere If there were only one key frame, the video would be ruined for any absence of even one bit of information of the intermediate frames
To overcome that, the encoder starts off with a new key frame every so often This effect, too, is something you can see rather easily DVDs and digital video recorders use similar
Figure 9.2: Compression Artifacts
Trang 9compression algorithms, with key frames spaced fairly far apart and a few dozen
intermediate frames in between When you fast forward or rewind the video, you may see these key frames go by, one by one, as still pictures This is quite different than the old
analog VCR days, where the video would just animate faster, and happens because
processing the intermediate frames takes too much time for fast forwarding
Figure 9.3 illustrates the parts of the intermediate frame that are used for added
compression
a) The ball moves from Frame 1 to Frame 2.
b) The intermediate frame encoding for Frame 2 stores that
the pixels that make up the ball in Frame 1 need to move
down and to the right to make Frame 2
c) The intermediate frame encoding for Frame 2 also stores the newly revealed pixels for Frame 2, shown here.
Figure 9.3: Motion Compression
Trang 10Video compression leaves much of the intelligence to the compressor: the decompressor is required only to execute the instructions More intelligent compressors can focus on the subject matter of importance, and the decompressor itself does not need to understand what matters more in the subject in order to do its job
There are a few types of video codecs, all of them similar to each other, but with some differences in the degree of intelligence and the type of formatting ITU H.262, the codec used in MPEG-2 video, is the most common codec, used in DVDs, as well as a wide variety of downloadable Internet video content The bitrate can go as high as 10Mbps for standard-definition DVD video
ITU H.264, for Advanced Video Coding (AVC) and the foundation for MPEG-4 video, was
designed to produce a far better picture at a significantly smaller bit rate—the goal is to be about half the bit rate of MPEG-2, for the same quality AVC includes a number of
improvements, including the ability for the decoders to smooth the edges between the blocks, such as those seen in Figure 9.2 AVC is the foundation of most high-definition
(HD) video, including for Blu-ray video discs and many satellite and cable television
transmissions AVC is also used in YouTube and other Adobe Flash–based video downloads Other, often proprietary, codecs exist for videoconferencing and webinar (web seminar) broadcasts, which can take advantage of the constrained subject matter—a series of heads or presentation slides, for example—to compress even better than general-purpose video compressors
9.1.3 Video Signaling and Bearer Technologies
Video must be carried in much the same way as voice The video flow or call may need to
be set up—this is especially true for conferencing—and then the video stream itself must be transported, along with the related audio streams
9.1.3.1 Video Bearer
Let’s start with the video transport, as the bearer, first Because many video downloads are streaming, rather than conferencing, both real-time and stream-based transports can be considered
Real-time video transport is often based on the same RTP mechanism that is used for voice When transported this way, each of the frames in the video may span multiple RTP packets The opposite also will happen, and multiple frames may meet in any given RTP packet However, the RTP mechanism applies the same timestamp and sequence number functions
to the video stream, allowing the video decoder to piece back together the stream when packets are lost or reordered The video sender can send separate RTP streams, sharing the same timestamp clock, for each of the media streams that make up the video This can be