1. Trang chủ
  2. » Công Nghệ Thông Tin

3D Graphics with OpenGL ES and M3G- P42 pptx

10 270 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề 3D Graphics with OpenGL ES and M3G
Trường học Standard University
Chuyên ngành Computer Science
Thể loại Bài luận
Năm xuất bản 2023
Thành phố City Name
Định dạng
Số trang 10
Dung lượng 144,51 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Values close to zero have a very high accuracy: two consecutive floats at around 1.0 have a preci-sion of1/16777216, floats at around 250.0 have roughly the same precision as fixed-point nu

Trang 1

does not require this large a range, as only magnitudes up to232need to be representable,

and for colors even210is enough The precision of these fixed-point numbers is fixed:

(1/65536), whereas the precision of floats depends on the magnitude of the values Values

close to zero have a very high accuracy: two consecutive floats at around 1.0 have a

preci-sion of1/16777216, floats at around 250.0 have roughly the same precision as fixed-point

numbers, while larger numbers become more inaccurate (two consecutive floats around

17 million are further than 1.0 units apart) OpenGL requires only accuracy of one part

in105, which is a little under 17 bits; single-precision floats have 24 bits of accuracy.

Below are C macros for converting from float to fixed and vice versa:

#define fixed_to_float( a ) (((float)a) / (1<<16))

These are “quick-and-dirty” versions of conversion float_to_fixed can overflow

if the magnitude of the float value is too great, or underflow if it is too small

fixed_to_float can be made slightly more accurate by rounding For example,

asymmetric arithmetic rounding works by adding 0.5 to the number before truncating

it to an integer, e.g., (int)floor((a) / 65536.0f + 0.5f)

Finally, note that some of these conversions are expensive on some processors and thus

should not be used in performance-critical code such as inner loops

Here are some 16.16 fixed-point numbers, expressed in hexadecimal, and the

correspond-ing decimal numbers:

Depending on the situation it may make sense to move the decimal point to some other

location, although 16.16 is a good general choice For example, if you are only interested in

numbers between zero and one (but excluding one), you should move the decimal point

all the way to the left; if you use 32 bits denote that with u0.32 (here u stands for unsigned).

In rasterization, the number of sub-pixel bits and the size of the screen in pixels determine

the number of bits you should have on the right side of the decimal point Signed 16.16

is a compromise that is relatively easy to use, and gives the same relative importance to

numbers between zero and one as to values above one

Trang 2

In the upcoming examples we also use other fixed-point formats For example, a 32.32 fixed-point value would be stored using 64 bits and it could be converted to a float by dividing it by232, whereas 32.16 would take 48 bits and have 32 integer and 16 decimal bits, and 32.0 would denote a regular 32-bit signed integer To distinguish between unsigned (such as u0.32) and signed two’s complement fixed-point formats we prepend

unsigned formats with u.

In this appendix, we first go through fixed-point processing in C We then follow by showing what you can do by using assembly language, and conclude with a section on fixed-point programming in Java

A.1 FIXED-POINT METHODS IN C

In this section we first discuss the basic fixed-point operations, followed by the shared exponent approach for vector operations, and conclude with an example that precalcu-lates trigonometric functions in a table

A.1.1 BASIC OPERATIONS

The addition of two fixed-point numbers is usually very straightforward (and subtraction

is just a signed add):

#define add_fixed_fixed( a, b ) ((a)+(b))

We have to watch out, though; the operation may overflow As opposed to floats, the overflow is totally silent, there is no warning about the result being wrong Therefore, you should always insert a debugging code to your fixed-point math, the main idea being that the results before and after clamping from 64-bit integers to 32-bit integers have to agree.1 Here is an example of how that can be done:

#if defined(DEBUG) int add_fixed_fixed_chk( int a, int b ) {

int64 bigresult = ((int64)a) + ((int64)b);

int smallresult = a + b;

assert(smallresult == bigresult);

return smallresult;

}

#endif

#if defined(DEBUG)

#else

#endif

1 Code examples are not directly portable Minimally you have to select the correct platform 64-bit type Examples:

Trang 3

Another point to note is that these fixed-point routines should always be macros or inlined

functions, not called through regular functions The function calling overhead would take

away most of the speed benefits of fixed-point programming For the debug versions using

regular functions is fine, though

Multiplications are more complicated than additions Let us analyze the case of

mul-tiplying two 16.16 numbers and storing the result into another 16.16 number When

we multiply two 16.16 numbers, the accurate result is a 32.32 number We ignore the

last 16 bits of the result simply by shifting right 16 steps, yielding a 32.16 number If

all the remaining bits are zero, either one or both of the operands were zero, or we

underflowed, i.e., the magnitude of the result was too small to be represented in a 16.16

fixed-point number Similarly, if the result is too large to fit in 16.16, we overflow But if

the result is representable as a 16.16 number, we can simply take the lowest 32 bits Note

that the intermediate result must be stored in a 64-bit integer, unless the magnitude of

the result is known to be under 1.0 before multiplication We are finally ready to define

multiplication:

#define mul_fixed_fixed( a, b ) (int)(((int64)(a)*(int64)(b)) >> 16)

If one of the multiplicands is an int, then the inputs are 16.16 and 32.0, the result is 48.16,

and we can omit the shift operation:

Multiplications overflow even more easily than additions The following example shows

how you can check for overflows in debug builds:

#if defined(DEBUG)

int mul_fixed_fixed_chk( int a, int b )

{

int64 bigresult = (((int64)a) * ((int64)b)) >> 16;

/* high bits must be just sign bits (0’s or 1’s) */

assert( (sign == 0) || (sign == — 1) );

return (int)bigresult;

}

#endif

Note also that multiplications by power-of-two are typically faster when done with shifts

instead of normal multiplication For example:

assert((a << 4) == (a * 16));

Let us then see how division works Dividing two 16.16 numbers gives you an integer,

and loses precision in the process However, as we want the result to be 16.16, we should

shift the nominator left 16 steps and store it in an int64 before the division This also

Trang 4

avoids losing the fractional bits Here are several versions of the division with different arguments (fixed or int), producing a 16.16 result:

#define div_fixed_fixed( a, b ) (int)( (((int64)(a))<<16) / (b) )

These simple versions do not check for overflows, nor do they trap the case b= 0 Divi-sion, however, is usually a much slower operation than multiplication If the interval of operations is small enough, it may be possible to precalculate a table of reciprocals and perform multiplication With a wider interval one can do a sparse table of reciprocals and interpolate the nearest results

For slightly more precision, we can incorporate rounding into the fixed-point operations Rounding works much the same way as when converting a float to a fixed-point number: add 0.5 before truncating to an integer Since we use integer division in the operations,

we just have to add0.5 before the division For multiplication this is easy and fairly cheap:

since our divider is the fixed value of1 << 16, we add one half of that, 1 << 15, before

the shift:

#define mul_fixed_fixed_round( a, b ) \ (int)( ((int64)(a) * (int64)(b) + (1<<15)) >> 16)

Similarly, for correct rounding in division of a by b, we should add b/2 to a before dividing by b.

A.1.2 SHARED EXPONENTS

Sometimes the range that is required for calculations is too great to fit into 32-bit registers

In some of those cases you can still avoid the use of full floating point For example, you can create your own floating-point operations that do not deal with the trickiest parts of the IEEE standard, e.g., the handling of infinities, NaNs (Not-a-Numbers), or floating-point exceptions

However, with vector operations, which are often needed in 3D graphics, another pos-sibility is to store the meaningful bits, the mantissas, separately into integers, perform

integer calculations using them, and to share the exponent across all terms For example,

if you need to calculate a dot product of a floating-point vector against a vector of inte-ger or fixed-point numbers, you could normalize the floating-point vector to a common base exponent, perform the multiplications and additions in fixed point, and finally, if needed, adjust the base exponent depending on the result Another name for this practice

of shared exponents is block floating point.

Using a shared exponent may lead to underflow, truncating some of the terms to zero In some cases such truncation may lead to a large error Here is a bit contrived example of a

Trang 5

worst-case error:[1.0e40, 1.0e8, 1.0e8, 1.0e8] · [0, 32768, 32768, 32768] With a shared

exponent the first vector becomes[1, 0, 0, 0] ∗ 1e40, which, when dotted with the second

vector, produces a result that is very different from the true answer

The resulting number sequence, mantissas together with the shared exponent, is really a

vectorized floating-point number and needs to be treated as such in the subsequent

calcu-lations, until to the point where the exponent can be finally eliminated It may seem that

since the exponent must be normalized in the end in any case, we are not saving much

Keep in mind, though, that the most expensive operations are only performed once for

the full dot product It may even be possible that the required multiplication and

addi-tion operaaddi-tions can be done with efficient multiply-and-accumulate (MAC) operaaddi-tions

in assembler if the processor supports such operations

Conversion from floating point vectors into vectorized floating point is only useful in

situations where the cost of conversion can be amortized somehow For example, if you

run 50 dot products where the floating-point vector stays the same and the fixed-point

vectors vary, this method can save a lot of computation An example where you might

need this kind of functionality is in your physics library A software implementation of

vertex array transformation by modelview and projection matrices is another example

where this approach could be attempted: multiplication of a homogeneous vertex with a

4 × 4 matrix can be done with four dot products

Many processors support operations that can be used for normalizing the result For

example ARM processors with the ARMv5 instruction set or later support the CLZ

instruction that counts the number of leading zero bits in an integer Even when the

processor supports these operations, they are only typically expressed either as

compiler-specific intrinsic functions or through inline assembler For example, a portable version

of count-leading-zeros can be implemented as follows:

/* Table stores the CLZ value for a byte */

static unsigned char clz_table[256] = { 8, 7, 6, 6, };

INLINE int clz_unsigned( unsigned int num )

{

int res = 24;

if (num >> 16)

{

num >>= 16;

res — = 16;

}

if (num > 255)

{

num >>= 8;

res — = 8;

}

Trang 6

return clz_table[num] + res;

} GCC compiler has a built-in command for CLZ that can be used like this:

INLINE int clz_unsigned( unsigned int num ) {

} The built-in will get compiled to ARM CLZ opcode when compiled to ARM target The performance of this routine depends on the processor architecture, and for some processors it may be faster to calculate the result with arithmetic instructions instead of table lookups

In comparison, the ARM assembly variant of the same thing is:

INLINE int clz_unsigned( unsigned int num ) {

int result;

asm {

} return result;

}

A.1.3 TRIGONOMETRIC OPERATIONS

The use of trigonometric functions such as sin, cos, or arctan can be expensive both in

floating-point and fixed-point domains But since these functions are repeating, sym-metric, have a compact range [−1,1], and can sometimes be expressed in terms of each

other (e.g., sin(θ+90) = cos(θ)), you can precalculate them directly into tables and store

the results in fixed point

A case in point is sin (and from that cos), for which only a90◦segment needs to be tab-ulated, and the rest can be obtained through the symmetry and continuity properties of

sin Since the table needs to be indexed by an integer, the input parameter needs to be

discretized as well Quantizing90◦to 1024 steps usually gives a good trade-off between accuracy, table size, and ease of manipulation of angle values (since 1024 is a power of two) The following code precalculates such a table

short sintable[1024];

int ang;

for( ang = 0; ang < 1024 ; ang++ )

Trang 7

/* angle_in_radians = ang/1024 * pi/2 */

double rad_angle = (ang * PI) / (1024.0 * 2.0);

sintable[ang] = (short)( — sin(rad_angle) * 32768.0);

}

In the loop we first convert the table index into radians Using that value we evaluate sin

and scale the result to the chosen fixed-point range The values of sin vary from 0.0 to 1.0

within the first quadrant If we multiply value 1.0 of sin by 32768.0 and convert to short,

the result overflows to zero A solution is to negate the sin values in the table and negate

those back after the value is read from the table

Here is an example function of extracting values for sin Note that the return value is sin

scaled by 32768.0

INLINE int fixed_sin( int angle )

{

int subang = angle & 1023;

else if ( phase == 1024 ) return — (int)sintable[ 1023 — subang ];

}

A.2 FIXED-POINT METHODS IN ASSEMBLY

LANGUAGE

Typically all processors have instructions that are helpful for fixed-point computations

For example, most processors support multiplication of two 32-bit values into a

64-bit result However, it may be difficult for the compiler to find the optimal instruction

sequence for the C code; direct assembly code is sometimes the only way to achieve good

performance Depending on the compiler and the processor, improvements of more than

2× can be often achieved using optimized assembly code

Let us take the fixed-point multiplication covered earlier as an example If you multiply

two 32-bit integers, the result will also be a 32-bit integer, which may overflow the results

before you have a chance to shift the results back into a safe range Even if the target

processor supports the optimized multiplication, it may be impossible to get a compiler

to generate such assembly instructions To be safe, you have to promote at least one of the

arguments to a 64-bit integer There are two solutions to this dilemma The first (easy)

solution is to use a good optimizing compiler that detects the casts around the operands,

and then performs a narrower and faster multiplication You might even be able to study

the machine code sequences that the compiler produces to learn how to express operations

Trang 8

so that they lead to efficient machine code The second solution is to use inlined assembly and explicitly use the narrowest multiply that you can get away with

Here we show an example of how to do fixed-point operations using ARM assembler ARM processor is a RISC-type processor with sixteen 32-bit registers (r0-r15), out of which r15 is restricted to program counter (PC) and r13 to stack pointer (SP), and r14 is typically used as a link register (LR); the rest are available for arbitrary use

All ARM opcodes can be prefixed with a conditional check based on which the operation

is either executed or ignored All data opcodes have three-register forms where a constant shift operation can be applied to the rightmost register operand with no performance cost For example, the following C-code

int INLINE foo( int a, int b ) {

int t = a + (b >> 16);

} executes in just two cycles when converted to ARM:

(reverse subtract) For more details about ARM assembler, see www.arm.com/documentation Note that the following examples are not optimized for any particular ARM implementa-tion The pipelining rules for different ARM variants, as well as different implementations

of each variant, can be different

The following example code multiplies a u0.32 fixed-point number with another u0.32 fixed-point number and stores the resulting high 32 bits to register r0

; assuming:

; r2 = input value 0

; r3 = input value 1

; result is directly in r0 register, low bits in r1

In the example above there is no need to actually shift the result by 32 as we can directly store the high bits of the result to the correct register To fully utilize this increased control

of operations and intermediate result ranges, you should combine primitive operations (add, sub, mul) into larger blocks The following example shows how to multiply a nor-malized vec4 dot product with a vertex or a normal vector represented as 16.16 fixed point

Trang 9

We want to make the code run as fast as possible and we have selected the fixed-point

ranges accordingly In the example we have chosen the range of the normalized vector of

the transformation matrix to be 0.30, as we are going to accumulate the results of four

multiplications together, and we need 2 bits of extra room for accumulation:

; input:

; r1-r4 = vec4 (assumed to be same over N input vectors) X,Y,Z,W

;

; in the code:

; 64-bit output is in r8:r7,

; we take the high 32 bits (r8 register) directly

As we implemented the whole operation as one vec4· vec4 dot product instead of a

collection of primitive fixed-point operations, we avoided intermediate shifts and thus

improved the accuracy of the result By using the 0.30 fixed-point format we reduced

the accuracy of the input vector by 2 bits, but usually the effect is negligible: remember

that even IEEE floats have only 24 significant bits With careful selection of ranges, we

avoided overflows altogether and eliminated a 64-bit shift operation which would require

several cycles By using ARM-specific multiply-and-accumulate instructions that operate

directly in 64 bits, we avoided doing 64-bit accumulations that usually require 2 assembly

opcodes: ADD and ADC (add with carry).

In the previous example the multiplication was done in fixed point If the input values,

e.g., vertex positions, are small, some accuracy is lost in the final output because of the

fixed position of the decimal point For more accuracy, the exponents should be tracked

as well In the following example the input matrix is stored in a format where each matrix

column has a common exponent and the scalar parts are normalized to that exponent

The code shows how one row is multiplied Note that this particular variant assumes

availability of the ARMv5 instruction CLZ and will thus not run on ARMv4 devices

; input:

;

Trang 10

; r2 — r6 = X,Y,Z,W,E (exponent)

; Code below does not do tight normalization (e.g., if

; we have number 0x00000000 00000001, we don’t return

; 0x40000000, but we subtract the exponent with 32 and return

; 0x00000001) This is because we do only highest-bit

; counting in the high 32 bits of the result No accuracy

; is lost due to this at this stage.

;

; If tight normalization is required, it can be added with

; extra comparisons.

; The following opcode (eor) calculates the rough abs(r9)

; value Positive values stay the same, but negative

; values are bit-inverted — > outcome of ~abs( — 1) = 0 etc.

; This is enough for our range calculation Note that we

; use arithmetic shift that extends the sign bits.

; note2: ARM register shift with zero returns the original value

; output in r9 (scalar) and r6 (exponent)

Ngày đăng: 03/07/2014, 11:20

TỪ KHÓA LIÊN QUAN