Values close to zero have a very high accuracy: two consecutive floats at around 1.0 have a preci-sion of1/16777216, floats at around 250.0 have roughly the same precision as fixed-point nu
Trang 1does not require this large a range, as only magnitudes up to232need to be representable,
and for colors even210is enough The precision of these fixed-point numbers is fixed:
(1/65536), whereas the precision of floats depends on the magnitude of the values Values
close to zero have a very high accuracy: two consecutive floats at around 1.0 have a
preci-sion of1/16777216, floats at around 250.0 have roughly the same precision as fixed-point
numbers, while larger numbers become more inaccurate (two consecutive floats around
17 million are further than 1.0 units apart) OpenGL requires only accuracy of one part
in105, which is a little under 17 bits; single-precision floats have 24 bits of accuracy.
Below are C macros for converting from float to fixed and vice versa:
#define fixed_to_float( a ) (((float)a) / (1<<16))
These are “quick-and-dirty” versions of conversion float_to_fixed can overflow
if the magnitude of the float value is too great, or underflow if it is too small
fixed_to_float can be made slightly more accurate by rounding For example,
asymmetric arithmetic rounding works by adding 0.5 to the number before truncating
it to an integer, e.g., (int)floor((a) / 65536.0f + 0.5f)
Finally, note that some of these conversions are expensive on some processors and thus
should not be used in performance-critical code such as inner loops
Here are some 16.16 fixed-point numbers, expressed in hexadecimal, and the
correspond-ing decimal numbers:
Depending on the situation it may make sense to move the decimal point to some other
location, although 16.16 is a good general choice For example, if you are only interested in
numbers between zero and one (but excluding one), you should move the decimal point
all the way to the left; if you use 32 bits denote that with u0.32 (here u stands for unsigned).
In rasterization, the number of sub-pixel bits and the size of the screen in pixels determine
the number of bits you should have on the right side of the decimal point Signed 16.16
is a compromise that is relatively easy to use, and gives the same relative importance to
numbers between zero and one as to values above one
Trang 2In the upcoming examples we also use other fixed-point formats For example, a 32.32 fixed-point value would be stored using 64 bits and it could be converted to a float by dividing it by232, whereas 32.16 would take 48 bits and have 32 integer and 16 decimal bits, and 32.0 would denote a regular 32-bit signed integer To distinguish between unsigned (such as u0.32) and signed two’s complement fixed-point formats we prepend
unsigned formats with u.
In this appendix, we first go through fixed-point processing in C We then follow by showing what you can do by using assembly language, and conclude with a section on fixed-point programming in Java
A.1 FIXED-POINT METHODS IN C
In this section we first discuss the basic fixed-point operations, followed by the shared exponent approach for vector operations, and conclude with an example that precalcu-lates trigonometric functions in a table
A.1.1 BASIC OPERATIONS
The addition of two fixed-point numbers is usually very straightforward (and subtraction
is just a signed add):
#define add_fixed_fixed( a, b ) ((a)+(b))
We have to watch out, though; the operation may overflow As opposed to floats, the overflow is totally silent, there is no warning about the result being wrong Therefore, you should always insert a debugging code to your fixed-point math, the main idea being that the results before and after clamping from 64-bit integers to 32-bit integers have to agree.1 Here is an example of how that can be done:
#if defined(DEBUG) int add_fixed_fixed_chk( int a, int b ) {
int64 bigresult = ((int64)a) + ((int64)b);
int smallresult = a + b;
assert(smallresult == bigresult);
return smallresult;
}
#endif
#if defined(DEBUG)
#else
#endif
1 Code examples are not directly portable Minimally you have to select the correct platform 64-bit type Examples:
Trang 3Another point to note is that these fixed-point routines should always be macros or inlined
functions, not called through regular functions The function calling overhead would take
away most of the speed benefits of fixed-point programming For the debug versions using
regular functions is fine, though
Multiplications are more complicated than additions Let us analyze the case of
mul-tiplying two 16.16 numbers and storing the result into another 16.16 number When
we multiply two 16.16 numbers, the accurate result is a 32.32 number We ignore the
last 16 bits of the result simply by shifting right 16 steps, yielding a 32.16 number If
all the remaining bits are zero, either one or both of the operands were zero, or we
underflowed, i.e., the magnitude of the result was too small to be represented in a 16.16
fixed-point number Similarly, if the result is too large to fit in 16.16, we overflow But if
the result is representable as a 16.16 number, we can simply take the lowest 32 bits Note
that the intermediate result must be stored in a 64-bit integer, unless the magnitude of
the result is known to be under 1.0 before multiplication We are finally ready to define
multiplication:
#define mul_fixed_fixed( a, b ) (int)(((int64)(a)*(int64)(b)) >> 16)
If one of the multiplicands is an int, then the inputs are 16.16 and 32.0, the result is 48.16,
and we can omit the shift operation:
Multiplications overflow even more easily than additions The following example shows
how you can check for overflows in debug builds:
#if defined(DEBUG)
int mul_fixed_fixed_chk( int a, int b )
{
int64 bigresult = (((int64)a) * ((int64)b)) >> 16;
/* high bits must be just sign bits (0’s or 1’s) */
assert( (sign == 0) || (sign == — 1) );
return (int)bigresult;
}
#endif
Note also that multiplications by power-of-two are typically faster when done with shifts
instead of normal multiplication For example:
assert((a << 4) == (a * 16));
Let us then see how division works Dividing two 16.16 numbers gives you an integer,
and loses precision in the process However, as we want the result to be 16.16, we should
shift the nominator left 16 steps and store it in an int64 before the division This also
Trang 4avoids losing the fractional bits Here are several versions of the division with different arguments (fixed or int), producing a 16.16 result:
#define div_fixed_fixed( a, b ) (int)( (((int64)(a))<<16) / (b) )
These simple versions do not check for overflows, nor do they trap the case b= 0 Divi-sion, however, is usually a much slower operation than multiplication If the interval of operations is small enough, it may be possible to precalculate a table of reciprocals and perform multiplication With a wider interval one can do a sparse table of reciprocals and interpolate the nearest results
For slightly more precision, we can incorporate rounding into the fixed-point operations Rounding works much the same way as when converting a float to a fixed-point number: add 0.5 before truncating to an integer Since we use integer division in the operations,
we just have to add0.5 before the division For multiplication this is easy and fairly cheap:
since our divider is the fixed value of1 << 16, we add one half of that, 1 << 15, before
the shift:
#define mul_fixed_fixed_round( a, b ) \ (int)( ((int64)(a) * (int64)(b) + (1<<15)) >> 16)
Similarly, for correct rounding in division of a by b, we should add b/2 to a before dividing by b.
A.1.2 SHARED EXPONENTS
Sometimes the range that is required for calculations is too great to fit into 32-bit registers
In some of those cases you can still avoid the use of full floating point For example, you can create your own floating-point operations that do not deal with the trickiest parts of the IEEE standard, e.g., the handling of infinities, NaNs (Not-a-Numbers), or floating-point exceptions
However, with vector operations, which are often needed in 3D graphics, another pos-sibility is to store the meaningful bits, the mantissas, separately into integers, perform
integer calculations using them, and to share the exponent across all terms For example,
if you need to calculate a dot product of a floating-point vector against a vector of inte-ger or fixed-point numbers, you could normalize the floating-point vector to a common base exponent, perform the multiplications and additions in fixed point, and finally, if needed, adjust the base exponent depending on the result Another name for this practice
of shared exponents is block floating point.
Using a shared exponent may lead to underflow, truncating some of the terms to zero In some cases such truncation may lead to a large error Here is a bit contrived example of a
Trang 5worst-case error:[1.0e40, 1.0e8, 1.0e8, 1.0e8] · [0, 32768, 32768, 32768] With a shared
exponent the first vector becomes[1, 0, 0, 0] ∗ 1e40, which, when dotted with the second
vector, produces a result that is very different from the true answer
The resulting number sequence, mantissas together with the shared exponent, is really a
vectorized floating-point number and needs to be treated as such in the subsequent
calcu-lations, until to the point where the exponent can be finally eliminated It may seem that
since the exponent must be normalized in the end in any case, we are not saving much
Keep in mind, though, that the most expensive operations are only performed once for
the full dot product It may even be possible that the required multiplication and
addi-tion operaaddi-tions can be done with efficient multiply-and-accumulate (MAC) operaaddi-tions
in assembler if the processor supports such operations
Conversion from floating point vectors into vectorized floating point is only useful in
situations where the cost of conversion can be amortized somehow For example, if you
run 50 dot products where the floating-point vector stays the same and the fixed-point
vectors vary, this method can save a lot of computation An example where you might
need this kind of functionality is in your physics library A software implementation of
vertex array transformation by modelview and projection matrices is another example
where this approach could be attempted: multiplication of a homogeneous vertex with a
4 × 4 matrix can be done with four dot products
Many processors support operations that can be used for normalizing the result For
example ARM processors with the ARMv5 instruction set or later support the CLZ
instruction that counts the number of leading zero bits in an integer Even when the
processor supports these operations, they are only typically expressed either as
compiler-specific intrinsic functions or through inline assembler For example, a portable version
of count-leading-zeros can be implemented as follows:
/* Table stores the CLZ value for a byte */
static unsigned char clz_table[256] = { 8, 7, 6, 6, };
INLINE int clz_unsigned( unsigned int num )
{
int res = 24;
if (num >> 16)
{
num >>= 16;
res — = 16;
}
if (num > 255)
{
num >>= 8;
res — = 8;
}
Trang 6return clz_table[num] + res;
} GCC compiler has a built-in command for CLZ that can be used like this:
INLINE int clz_unsigned( unsigned int num ) {
} The built-in will get compiled to ARM CLZ opcode when compiled to ARM target The performance of this routine depends on the processor architecture, and for some processors it may be faster to calculate the result with arithmetic instructions instead of table lookups
In comparison, the ARM assembly variant of the same thing is:
INLINE int clz_unsigned( unsigned int num ) {
int result;
asm {
} return result;
}
A.1.3 TRIGONOMETRIC OPERATIONS
The use of trigonometric functions such as sin, cos, or arctan can be expensive both in
floating-point and fixed-point domains But since these functions are repeating, sym-metric, have a compact range [−1,1], and can sometimes be expressed in terms of each
other (e.g., sin(θ+90◦) = cos(θ)), you can precalculate them directly into tables and store
the results in fixed point
A case in point is sin (and from that cos), for which only a90◦segment needs to be tab-ulated, and the rest can be obtained through the symmetry and continuity properties of
sin Since the table needs to be indexed by an integer, the input parameter needs to be
discretized as well Quantizing90◦to 1024 steps usually gives a good trade-off between accuracy, table size, and ease of manipulation of angle values (since 1024 is a power of two) The following code precalculates such a table
short sintable[1024];
int ang;
for( ang = 0; ang < 1024 ; ang++ )
Trang 7/* angle_in_radians = ang/1024 * pi/2 */
double rad_angle = (ang * PI) / (1024.0 * 2.0);
sintable[ang] = (short)( — sin(rad_angle) * 32768.0);
}
In the loop we first convert the table index into radians Using that value we evaluate sin
and scale the result to the chosen fixed-point range The values of sin vary from 0.0 to 1.0
within the first quadrant If we multiply value 1.0 of sin by 32768.0 and convert to short,
the result overflows to zero A solution is to negate the sin values in the table and negate
those back after the value is read from the table
Here is an example function of extracting values for sin Note that the return value is sin
scaled by 32768.0
INLINE int fixed_sin( int angle )
{
int subang = angle & 1023;
else if ( phase == 1024 ) return — (int)sintable[ 1023 — subang ];
}
A.2 FIXED-POINT METHODS IN ASSEMBLY
LANGUAGE
Typically all processors have instructions that are helpful for fixed-point computations
For example, most processors support multiplication of two 32-bit values into a
64-bit result However, it may be difficult for the compiler to find the optimal instruction
sequence for the C code; direct assembly code is sometimes the only way to achieve good
performance Depending on the compiler and the processor, improvements of more than
2× can be often achieved using optimized assembly code
Let us take the fixed-point multiplication covered earlier as an example If you multiply
two 32-bit integers, the result will also be a 32-bit integer, which may overflow the results
before you have a chance to shift the results back into a safe range Even if the target
processor supports the optimized multiplication, it may be impossible to get a compiler
to generate such assembly instructions To be safe, you have to promote at least one of the
arguments to a 64-bit integer There are two solutions to this dilemma The first (easy)
solution is to use a good optimizing compiler that detects the casts around the operands,
and then performs a narrower and faster multiplication You might even be able to study
the machine code sequences that the compiler produces to learn how to express operations
Trang 8so that they lead to efficient machine code The second solution is to use inlined assembly and explicitly use the narrowest multiply that you can get away with
Here we show an example of how to do fixed-point operations using ARM assembler ARM processor is a RISC-type processor with sixteen 32-bit registers (r0-r15), out of which r15 is restricted to program counter (PC) and r13 to stack pointer (SP), and r14 is typically used as a link register (LR); the rest are available for arbitrary use
All ARM opcodes can be prefixed with a conditional check based on which the operation
is either executed or ignored All data opcodes have three-register forms where a constant shift operation can be applied to the rightmost register operand with no performance cost For example, the following C-code
int INLINE foo( int a, int b ) {
int t = a + (b >> 16);
} executes in just two cycles when converted to ARM:
(reverse subtract) For more details about ARM assembler, see www.arm.com/documentation Note that the following examples are not optimized for any particular ARM implementa-tion The pipelining rules for different ARM variants, as well as different implementations
of each variant, can be different
The following example code multiplies a u0.32 fixed-point number with another u0.32 fixed-point number and stores the resulting high 32 bits to register r0
; assuming:
; r2 = input value 0
; r3 = input value 1
; result is directly in r0 register, low bits in r1
In the example above there is no need to actually shift the result by 32 as we can directly store the high bits of the result to the correct register To fully utilize this increased control
of operations and intermediate result ranges, you should combine primitive operations (add, sub, mul) into larger blocks The following example shows how to multiply a nor-malized vec4 dot product with a vertex or a normal vector represented as 16.16 fixed point
Trang 9We want to make the code run as fast as possible and we have selected the fixed-point
ranges accordingly In the example we have chosen the range of the normalized vector of
the transformation matrix to be 0.30, as we are going to accumulate the results of four
multiplications together, and we need 2 bits of extra room for accumulation:
; input:
; r1-r4 = vec4 (assumed to be same over N input vectors) X,Y,Z,W
;
; in the code:
; 64-bit output is in r8:r7,
; we take the high 32 bits (r8 register) directly
As we implemented the whole operation as one vec4· vec4 dot product instead of a
collection of primitive fixed-point operations, we avoided intermediate shifts and thus
improved the accuracy of the result By using the 0.30 fixed-point format we reduced
the accuracy of the input vector by 2 bits, but usually the effect is negligible: remember
that even IEEE floats have only 24 significant bits With careful selection of ranges, we
avoided overflows altogether and eliminated a 64-bit shift operation which would require
several cycles By using ARM-specific multiply-and-accumulate instructions that operate
directly in 64 bits, we avoided doing 64-bit accumulations that usually require 2 assembly
opcodes: ADD and ADC (add with carry).
In the previous example the multiplication was done in fixed point If the input values,
e.g., vertex positions, are small, some accuracy is lost in the final output because of the
fixed position of the decimal point For more accuracy, the exponents should be tracked
as well In the following example the input matrix is stored in a format where each matrix
column has a common exponent and the scalar parts are normalized to that exponent
The code shows how one row is multiplied Note that this particular variant assumes
availability of the ARMv5 instruction CLZ and will thus not run on ARMv4 devices
; input:
;
Trang 10; r2 — r6 = X,Y,Z,W,E (exponent)
; Code below does not do tight normalization (e.g., if
; we have number 0x00000000 00000001, we don’t return
; 0x40000000, but we subtract the exponent with 32 and return
; 0x00000001) This is because we do only highest-bit
; counting in the high 32 bits of the result No accuracy
; is lost due to this at this stage.
;
; If tight normalization is required, it can be added with
; extra comparisons.
; The following opcode (eor) calculates the rough abs(r9)
; value Positive values stay the same, but negative
; values are bit-inverted — > outcome of ~abs( — 1) = 0 etc.
; This is enough for our range calculation Note that we
; use arithmetic shift that extends the sign bits.
; note2: ARM register shift with zero returns the original value
; output in r9 (scalar) and r6 (exponent)