cryptography for developers PHẦN 9 docx

Putting It All Together This chapter introduced the two standard encrypt and authenticate modes as specified byboth NIST and IEEE.They are both designed to take a single key and IV nonce

Trang 1

In most modern operating systems, the memory used by a program (or process)

is known as virtual memory The memory has no fixed physical address and can

be moved between locations and even swapped to disk (through page

invalida-tion) This latter action is typically known as swap memory, as it allows users to

emulate having more physical memory than they really do

The downside to swap memory, however, is that the process memory couldcontain sensitive information such as private keys, usernames, passwords, and

other credentials To prevent this, an application can lock memory In operating

systems such as those based on the NT kernel (e.g., Win2K, WinXP), locking isentirely voluntary and the OS can choose to later swap nonkernel data out

In POSIX compatible operating systems, such as those based on the Linuxand the BSD kernels, a set of functions such as mlock(), munlock(), mlockall(),and so forth have been provided to facilitate locking Physical memory in mostsystems can be costly, so the polite and proper application will request to lock

as little memory as possible In most cases, locked memory will span a region

that contains pages of memory On the x86 series of processors, a page is four

kilobytes This means that all locked memory will actually lock a multiple of fourkilobytes

Ideally, an application will pool its related credentials to reduce the number

of physical pages required to lock them in memory

plain-is nicely bundled up in a single function call, making its deployment rather trivial

Putting It All Together

This chapter introduced the two standard encrypt and authenticate modes as specified byboth NIST and IEEE.They are both designed to take a single key and IV (nonce) and pro-duce a ciphertext and message authentication code tag, thereby simplifying the process fordevelopers by reducing the number of different standards they must support, and in practicethe number of functions they have to call to accomplish the same results

www.syngress.com

338 Chapter 7 • Encrypt and Authenticate Modes

Trang 2

Knowing how to use these modes is a matter of properly choosing an IV, making idealuse of the additional authentication data (AAD), and checking the MAC tag they produce.

Neither of these two modes will manage any of these properties for the developer, so they

must look after them carefully

For most applications, it is highly advisable to use these modes over an ad hoc tion of encryption and authentication, if not solely for the reason of code simplicity, then

combina-also for proper adherence to cryptographic standards

What Are These Modes For?

We saw in the previous chapter how we could accomplish both privacy and authentication

of data through the combined use of a symmetric cipher and chaining mode with a MAC

algorithm Here, the goal of these modes is to combine the two.This accomplishes several

key goals simultaneously As we alluded to in the previous chapter, CCM and GCM are also

meant for small packet messages, ideal for securing a stream of messages between parties

CCM and GCM can be used for offline tasks such as file encryption, but they are not meant

for such tasks (especially CCM since it needs the length of the plaintext in advance)

First, combining the modes makes development simpler—there is only one key and one

IV to keep track of.The mode will handle using both for both tasks.This makes key

deriva-tion easier and quicker, as less session data must be derived It also means there are fewer

variables floating around to keep track of

These combined modes also mean it’s possible to perform both goals with a single tion call In code where we specifically must trap error codes (usually by looking at the

func-return codes), having fewer functions to call means the code is easier to write safely While

there are other ways to trap errors such as signals and internal masking, making threadsafe

global error detection in C is rather difficult

In addition to making the code easier to read and write, combined modes make thesecurity easier to analyze CCM, for instance, is a combination of CBC-MAC and CTR

encryption mode In various ways, we can reduce the security of CCM to the security of

these modes In general, with a full length MAC tag, the security of CCM reduces to the

security of the block cipher (assuming a unique nonce and random key are used)

What we mean by reduce is that we can make an argument for equivalence For example,

if the security of CTR is reducible to the security of the cipher, we are saying it is as secure

as the latter By this reasoning, if one could break the cipher, he could also break CTR

mode (Strictly speaking, the security of CTR reduces to the determination of whether the

cipher is a PRP.)

So, in this context, if we say CCM reduces to the security of the cipher in terms ofbeing a proper pseudo-random permutation (PRP), then if we can break the cipher (by

showing it is not a PRP), we can likely break CCM Similarly, GCM reduces to the security

of the cipher for privacy and to universal hashing for the MAC It is more complicated to

prove that it can be secure.

Encrypt and Authenticate Modes • Chapter 7 339

Trang 3

Choosing a Nonce

Both CCM and GCM require a unique nonce (N used once) value to maintain their vacy and authenticity goals In both cases, the value need not be random, but merely uniquefor a given key.That is, you can safely use the same nonce (only once, though) between twodifferent keys Once you use the nonce for a particular key, you cannot use it again

pri-GCM Nonces

GCM was designed to be most efficient with 12-byte nonce values Any longer or shorterand GHASH is used to create an IV for the mode In this case, we can simply use the 12-byte nonce as a packet counter Since we have to send the nonce to the other party anyway,this means we can double up on the purpose of this field Each packet would get its own12-byte nonce (by incrementing it), and the receiver can check for replays and out of orderpackets by checking the nonce as if it were a 96-bit number

You can use the 12-byte number as either a big or little endian value, as GCM will nottruncate the nonce

Additional Authentication Data

Both CCM and GCM support a sort of side channel known as additional authenticationdata (AAD).This data is meant to be nonprivate data that should influence the MAC tagoutput.That is, if the plaintext and AAD are not present together and unmodified, the tagshould reflect that

The usual use for AAD is to store session metadata along with the packet.Things such asusername, session ID, and transaction ID are common.You would never use a user credential,since it would not really be something you need on a per-packet basis

Both protocols support empty AAD strings Only GCM is optimized to handle AADstrings that are a multiple of 16 bytes long CCM inserts a four- or six-byte header that off-

Trang 4

sets the data and makes it harder to optimize for In general, if you are using CCM, try to

have very short AAD strings, preferably less than 10 bytes, as you can cram that into a single

encrypt invocation For GCM, try to have your AAD strings a multiple of 16 bytes even if

you have to pad with zero bytes (the implementation cannot do this padding for you, as it

would change the meaning of the AAD)

MAC Tag Data

The MAC tag produced by both implementations is not checked internally.The typical

usage would involve transmitting the MAC tag with the ciphertext, and the recipient would

compare it against the one he generated while decrypting the message

In theory at least, you can truncate the MAC tag to short lengths such as 80 or 96 bits

However, some evidence points to the contrary with GCM, and in reality the savings are

trivial As the research evolves, it would be best to read up on the latest analysis papers of

GCM and CCM to see if short tags are still in fact secure

In practice, you can save more space if you aggregate packets over a stable channel Forexample, instead of sending 128-byte packets, send 196- or 256-byte packets.You will send

fewer nonces (and protocol data), and the difference can allow you to use a longer MAC tag

Clearly, this does not work in low latency switching cases (e.g., VoIP), so it is not a

bullet-proof suggestion

Example Construction

For our example, we will borrow from the example of Chapter 6, except instead of using

HMAC and CTR mode, we will use CCM.The rest of the demo works the same Again, we

will be using LibTomCrypt, as it provides a nice CCM interface that is very easy to use in

We only have one key and its 16 bytes long.

005 /* Our Per Packet Sizes, Nonce len and MAC len */

006 #define NONCELEN 4

007 #define MACLEN 12

008 #define OVERHEAD (NONCELEN+MACLEN)

As in the previous example, we have a packet counter length (the length of our nonce),MAC tag length, and the composite overhead Here we have a 32-bit counter and a 96-bit

Trang 5

014 /* our nice containers */

039 unsigned char tmp[2*KEYLEN];

040 unsigned long tmplen;

Trang 6

We use our packet counter PktCTR as the a method of keeping the packets in order.

We also use it as a nonce for the CCM algorithm.The output generated will first consist of

the nonce, followed by the ciphertext, and then the MAC tag

Trang 7

This function call will generate the ciphertext and MAC tag for us Libraries are nice.

120 out+inlen+NONCELEN, &taglen, CCM_ENCRYPT))

These are the plaintext and ciphertext buffers.The plaintext is always specified firstregardless of the direction we are using It may get a bit confusing for those expecting an

“input” and “output” order of arguments

127 int decode_frame(const unsigned char *in,

133 unsigned char tag[MACLEN];

134 unsigned long taglen;

Trang 8

137 if (inlen < MACLEN+NONCELEN) { return -1; }

138 inlen -= MACLEN+NONCELEN;

As before, we assume that inlen is the length of the entire packet and not just the

plain-text We first make sure it is a valid length, and then subtract the MAC tag and nonce from

For both encryption and decryption, the ccm_memory() function is used In decrypt

mode, it is used much the same, except our out buffer goes in the plaintext argument

posi-tion.Therefore, it appears before the input, which seems a bit awkward

Note that we store the MAC tag locally since we need to compare it next

165 memcpy(stream->channels[1].PktCTR, in, NONCELEN);

At this point, our packet is valid, so we copy the nonce out and store it locally, thus venting future replay attacks

Trang 9

Q: What is an Encrypt and Authenticate algorithm?

A: These are algorithms designed to accept a single key and IV, and allow the caller to bothencrypt and authenticate his message Unlike distinct encrypt and authenticate protocols,these modes do not require two independent keys for security

Q: What are the benefits of these modes over the distinct modes?

A: There are two essential benefits to these modes First, they are easier to incorporate intocryptosystems, as they require fewer keys, and perform two functions at the same time.This means a developer can accomplish two goals with a single function call It alsomeans they are more likely to start authenticating traffic.The second benefit is that theyare often reducible to the security of the underlying cipher or primitive

Q: What does reducible mean?

A: When we say one problem is reducible to another, we imply that solving the formerinvolves a solution to the latter For instance, the security of CTR-AES reduces to theproblem of whether AES is a true pseudo random permutation (PRP) If AES is not aPRP, CTR-AES is not secure

Q: What Encrypt and Authenticate modes are standardized?

A: CCM is specified in the NIST publication SP 800-38C GCM is an IEEE standard andwill be specified in the NIST publication SP 800-38D in the future

Q: Should I use these modes over distinct modes such as AES-CTR and SHA1-HMAC?

A: That depends on whether you are trying to adhere to an older standard If you are not,most likely GCM or CCM make better use of system resources to accomplish equiva-lent goals.The distinct modes can be just as secure in practice, but are generally harder

to implement and verify in fielded applications.They also rely on more security tions, which poses a security risk

assump-www.syngress.com

Frequently Asked Questions

The following Frequently Asked Questions, answered by the authors of this book,are designed to both measure your understanding of the concepts presented in this chapter and to assist you with real-life implementation of these concepts Tohave your questions about this chapter answered by the author, browse to

www.syngress.com/solutions and click on the “Ask the Author” form

Trang 10

Q: What is a nonce and what is the importance of it?

A: A nonce is a parameter (like an initial value) that is meant to be used only once Nonces

are used to introduce entropy into an otherwise static and fully deterministic process(that is, after the key has been chosen) Nonces must be unique under one key, but may

be reused across different keys Usually, they are simply packet counters Without aunique nonce, both GCM and CCM have no provable security properties

Q: What is additional authentication data? AAD? Header data?

A: Additional authentication Data (AAD), also known as header data and associated data (in

EAX), is data related to a message, that must be part of the MAC tag computation butnot encrypted.Typically, it is simple metadata related to the session that is not private butaffects the interpretation of the messages For instance, an IP datagram encoder coulduse parts of the IP datagram as part of the header data.That way, if the routing ischanged it is detectable, and since the IP header cannot be encrypted (without usinganother IP standard), it must remain in plaintext form.That is only one example ofAAD; there are others In practice, it is not used much but provided nonetheless

Q: Which mode should I use? GCM or CCM?

A: That depends on the standard you are trying to comply with (if any) CCM is

cur-rently the only NIST standard of the two If you are working with wireless standards,chances are you will need GCM Outside of those two cases, it depends on the plat-form GCM is very fast, but requires a huge 64-kilobyte table to achieve this speed (atleast in software) CCM is a slight bit slower, but is much simpler in design and imple-mentation CCM also does not require huge tables to achieve its performance claims

If you removed the tables from GCM, it would be much slower than CCM

Q: Are there any patents on these modes?

A : Neither of GCM and CCM is patented.They are free for any use

Q: Where can I find implementations?

A: Currently, only LibTomCrypt provides both GCM and CCM in a library form

Brian Gladman has independent copyrighted implementations available Crypto++

has neither, but the author is aware of this

Q: What do I do with the MAC tag?

A: If you are encrypting, transmit it along with your ciphertext and AAD data (if any)

If you are decrypting, the decryption will generate a MAC tag; compare that withthe value stored along with the ciphertext If they do not compare as equal, themessage has been altered or the stored MAC tag is invalid

Encrypt and Authenticate Modes • Chapter 7 347

Trang 12

Large Integer Arithmetic

Solutions in this chapter:

■ What Are BigNums?

■ Why Do We Need Them for Cryptography?

■ What Algorithms Are Most Important?

Solutions Fast Track

Frequently Asked Questions

Trang 13

So far, we have been examining symmetric key algorithms that rely solely on secret keys forsecurity Now we are going to explore the realm of public key cryptography, but before wecan do this, we have a significant piece of mathematics to cover

Most standard public key algorithms are based on problems that are hard to solve ingeneral For example, the RSA algorithm is (loosely speaking) as secure as factoring is hard.That is, if factoring is hard, breaking RSA is, too (in practice) Similarly, elliptic curve algo-rithms are as hard to break as inverting point multiplication on the given curve

In both cases, the “problem” becomes harder as you increase the size of the problem Inthe case of RSA, as you increase the composite (public key), factoring becomes harder.Similarly, as you increase the order of the elliptic curve (do not worry if you do not knowwhat that means at this point), the difficulty of inverting point multiplication increases

To accommodate these larger parameters, we must deploy algorithms known collectively

as BigNum algorithms

What Are BigNums?

As developers, you are most likely aware of the size limitations of your preferred data types

In C, for example, the int data type can represent only up to 32767 in a portable fashion.

Even though on certain platforms it may be 32 or even 64 bits in length, you cannot alwayscount on that

Most programming languages at best have a 64-bit data type at their disposal If we had

to live within these constraints directly, we would have very insecure public key algorithms.Factoring 64-bit numbers is a trivial task, for instance

To work around these limitations, we use algorithms typically known as either multiple

or fixed precision algorithms These algorithms construct representations of larger integers

using the supported data type (such as unsigned long) as a digit—much like we construct

larger numbers in decimal by appending more digits in the range 0–9

The digits (also known as limbs in various libraries) form the basis of all arithmetic.

They are typically chosen to be efficient on the target platform For example, if your

machine can perform operations on unsigned long data types efficiently, that will likely be

your digit type The math we must perform works much the same as was taught in schoolwith decimal The only difference being that instead of using 10 as our radix we use apower of 2, such as 232

Fixed and multiple precision algorithms differ little in theory, and mostly in tion In multiple precision algorithms, we seek to accommodate any size integer by allo-cating new memory as required to represent the result of an operation This is nice to have

implementa-if we are dealing with numbers of unknown dimensions However, it has the overhead ofperforming heap operations in the middle of calculations, which tend to be rather slow.Fixed precision algorithms only have a limited (fixed) space to store the representation of a

350 Chapter 8 • Large Integer Arithmetic

Trang 14

number As such, there is no need for heap operations in the middle of a calculation Fixed

precision is well suited for tasks where the dimensions of the inputs are known in advance

For example, if you are implementing ECC P-192, you know your largest number will be

384 bits (2*192) and you can adjust accordingly

In this text, we focus on fixed precision implementation details, as they are more cient We dispense with much of the discussion of BigNum math and instead defer the

effi-reader to various sources

Further Resources

For the curious reader, various other sources cover BigNum mathematics in greater depth

The book BigNum Math covers the topic by presenting the development of a public domain

math library, LibTomMath It is certainly worth a read for those new to the subject The

text The Art of Computer Programming Volume 2 also covers the topic from a theoretical angle,

presenting key mathematical concepts behind efficient algorithms

The reader should also examine the freely available source code for the LibTomMathand TomsFastMath packages, which implement multiple and fixed (respectively) precision

arithmetic The source codes for both are well documented and free to use for any purpose

Key Algorithms

Certain key algorithms when implemented correctly can make a BigNum library perform

very well Even though there are dozens of algorithms involved in something like an elliptic

curve point addition, there are only four algorithms of critical importance: multiplication,

squaring, reduction, and modular inversion

When performing typical public key operations, the processor time usage can usuallybroken down from most to least to reduction, multiplication, squaring, and modular inver-

sion—which gives you a good idea of where to spend the most amount of time when

optimizing

In practice, we are talking about the same algorithms that are taught to young children for things like multiplication The key difference is how they are implemented Things like

accumulator structure, loop unrolling, and opcode selection can make all the difference We

will present the four algorithms using code suitable for x86, PPC, and ARM processors using

GCC as our development tool of choice

The Algorithms

Represent!

Before we can dive into the algorithms required for efficient public key cryptography, we

must discuss the structure we will use to represent these integers For our purposes, we will

borrow from the TomsFastMath package, which uses a structure similar to the following:

Large Integer Arithmetic • Chapter 8 351

Trang 15

where fp_digit is a type for a single digit, usually equivalent to unsigned long but can change

to suit the platform Along with that type is an fp_word type that we have not seen yet It is required to be twice the size of an fp_digit and be able to hold the product of two maximum valued fp_digit variables The FP_SIZE macro is the default maximum size of an integer It

is based on the size of a digit and the number of bits required For instance, if fp_digit is a 32-bit type and you wish to represent up to 384-bit integers, FP_SIZE must be at least 12 The used flag indicates how many digits of the dp array are nonzero This allows us to

manipulate shorter integers more efficiently For example, just because your integers can be

up to (say) 12 digits does not mean they all are going to be that big Performing 12-digitoperations on numbers that may only be 6 digits long wastes time and resources

The sign flag denotes the sign of the integer It is 0 for non-negative, and nonzero for

negative This means that our integers are always unsigned and only this flag controls thebehavior

The digits of the fp_int type are used in little endian order fashion For example, with a 32-bit fp_digit if the dp array contained {1, 2, 3, 4, 0, 0, 0, , 0}, that would represent the

integer 1 + 2*232+ 3*264+ 4*296

Multiplication

Multiplication, much like most BigNum problems, can be solved with a variety of rithms Algorithms such as Karatsuba and Toom-Cook linearized multiplication claim effi-cient asymptotic time requirements, but are in fact not that useful in practice—at least, notwith typical software platforms They do come in handy with hardware platforms, however

algo-It turns out the most profitable way to multiply two numbers of the size we use in

public key algorithms is to use the basic O(n2) long-hand algorithm That is, if we are

multi-plying A by B, we would multiply every digit of B by every digit of A and accumulate (add up) all of the products Assuming that A and B have the same number of digits, n, this process requires n 2 single precision multiplications (that is, multiplications of fp_digit types)

(Figure 8.1)

Trang 16

Figure 8.1 Multiplication Algorithm

This algorithm produces one digit of the product per iteration of the loop (started on

step 5) We use a three-digit accumulator {c0, c1, c2} to accumulate the products inside the

inner loop This is accomplished by adding the two-digit product to the accumulator, and

then propagating the carry to the third digit It may sound complicated but, as we will see

shortly, it is entirely efficient

Outside of the inner loop, the lowest digit of the accumulator holds the digit of the

product in question We store c0 to C and then shift the accumulator down—c1 becomes

c0, c2 becomes c1, and then c2 is zeroed.

The MIN() macro is used to determine the smallest of the two operands Using it inthe loop may seem like a branch hazard, and in practice, it is However, as we will see, loop

unrolling can remove this from our code The performance of the implementation depends

mostly on the ability to perform the single inner loop step efficiently

Trang 17

Lets examine the generic multiply from TomsFastMath before we look at the macrosthat make this feasible.

Ripped from fp_mul_comba.c:

So far, this is much like our pseudo code (which was actually derived from this code)

025 for (ix = 0; ix < pa; ix++) {

026 /* get offsets into the two bignums */

034 /* this is the number of times the loop will iterrate

035 while (tx++ < a->used && ty >= 0) { }

037 iy = MIN(A->used-tx, ty+1);

At this point, our inner loop is ready to execute We use pointer aliases tmpx and tmpy to

point to the digits of A and B, respectively This makes our inner loop simpler

Trang 18

042 MULADD(*tmpx++, *tmpy );

COMBA_FORWARD performs the operation of shifting the accumulator MULADD

performs the operation of {c0, c1, c2} += *tmpx * *tmpy At this point, we have no idea

how this will be done The point of showing this coding technique is to illustrate the power

At this point, we have C code for multiplication, but we have no idea how the most vital

portions of the code actually work We have cleverly hidden their details in a series of C

preprocessor macros Why would we do this? To make the code harder to follow? To make

it harder for others to use? The answer is diversity.

Our macro scheme is the basis of a library known as TomsFastMath It currently targets

four different architectures without re-writing significant portions of the code Our macros

are flexible enough to allow us to work with x86 in 32- and 64-bit mode (as well with

SSE2), 32 and 64-bit PPC, and ARMv4 platforms—even though the instruction sets of the

x86, PPC, and ARM do not resemble one another

First, let’s examine the portable C implementation of these macros just to get a goodidea of what they are supposed to accomplish

#define COMBA_START

This macro has no use inside the logic of the multiplication other than to make it easier

to support various platforms

Trang 19

This macro zeros the accumulator.

These two macros store values out of the accumulator We have seen

COMBA_STORE, but not COMBA_STORE2, as it is part of the unrolled code

t = (fp_word)c0 + ((fp_word)i) * ((fp_word)j); c0 = t; \

t = (fp_word)c1 + (t >> DIGIT_BIT);c1 = t; c2 += t >> DIGIT_BIT; \

} while (0);

This performs the inner loop multiplication and accumulate We use a double precision

fp_word to hold the product and then add it to the accumulator By using double precision

types, we avoid the otherwise required test for overflow This is because the C language lacks

an “add with carry” operation otherwise required to propagate carry bits

So far, of all the platforms we support with our code, the only macro that changes isMULADD Table 8.1 lists various platforms and their respective macros

Table 8.1MULADD Macros for Various Platforms

Continued

Trang 20

Table 8.1 continuedMULADD Macros for Various Platforms

x86_32 + SSE2 #define MULADD(i, j)\

Continued

Trang 21

Table 8.1MULADD Macros for Various Platforms

All five macros accomplish the same goal, but with different architectures in mind They

all multiply i by j and accumulate the product in {c0, c1, c2}.

The x86_32 and x86_64 macros use the MUL instruction from the x86 instruction set

They load the i operand into the EAX (RAX, resp.) register and then perform a multiply against the fp_digit pointed to by j The product in the x86 instruction set is always stored in

the EDX:EAX (RDX:RAX resp.) registers This product is then accumulated into the

three-digit accumulator {c0, c1, c2}, which we ask GCC to alias to processor registers When

GCC has parsed this macro, it turns into assembler output resembling the following

movq (%rsp),%rax mulq 8(%rsp) addq %rax,%rcx adcq %rdx,%rsi adcq $0,%rdi

For those who cannot read x86_64 assembler, GCC has assigned {c0, c1, c2} to {rcx, rsi,

rdi}, three processor registers This is partly what makes the code so efficient The productsare accumulated without additional memory access

The x86_32 SSE2 code is meant for Pentium 4 Northwood (before Prescott) processors

In these processors, the FPU is used for all integer MUL instructions, which means that ifyou simply use the FPU directly, you can get the product faster Intel later improved theircores and this is no longer the case This code is not efficient on AMD processors, at leastnot compared to the integer multiplier AMD has

It would seem that using SSE2 to perform two 32-bit multiplications in parallel would

be a faster way to perform multiplication This is not true, however The AMD64 (andOpteron) series of processors can perform a single 64-bit multiplication in roughly five pro-cessor cycles You would need four times as many 32-bit multiplications to accomplish whatyou could with 64-bit multiplications Therefore, even if you are doing two at once, youhave to accomplish them in less than half the time to become faster Currently, the SSE2multiplication is not a single cycle on AMD64 processors, nor will it be in the near future

Trang 22

The PPC32 code is another straight adaptation to the instruction set The PPC differsfrom other instruction sets in that only half of the product is produced per opcode The

mullw instruction produces the lower 32 bits of the product, while the mulhwu produces

higher 32 bits We could place both multiplications back to back; however, we want to avoid

clobbering1as many registers as possible (clobbering is the GCC term for destroying the

con-tents of a register inside an assembler macro) This macro clobbers only one register and

requires three others for the accumulator

The PPC64 instruction set has similar opcodes; for instance, mulld and mulhdu perform

the equivalent 64-bit multiplications Those are currently not in the project due to lack of

access to a PPC64 machine

The ARMv4 code requires a v4 core with the M features (e.g., ARM7TDMI) or a v5

core or higher It is perhaps the most elegant of all the code We store the product in r0

and r1 and then accumulate it The astute reader may notice we do not use the ARMv4

“multiply and accumulate” instruction This is because it does not set the carry flag

Therefore, we could not pack 32 bits per digit if we used it

Code Unrolling

So, now we have the general technique, and macros to plug into the code Now we need to

organize the code for real raw performance The real boost comes from unrolling the entire

inner and outer loops If we know in advance what size numbers we are multiplying, this

can pay off big

Now, instead of unrolling a multiply by hand, or telling GCC implicitly the size of ournumbers, we will craft a routine that is fully explicitly unrolled To do this, we will write a

program that emits C source code for a multiplier For reference, we will use another source

file from the TomsFastMath project This program accepts a single dimension N as input and

produces the C source code for a N-by-N multiply

ripped from comba_mult_gen.c:

011 /* program emits a NxN comba multiplier */

021 "void fp_mul_comba%d(fp_int *A, fp_int *B, fp_int *C)\n"

We will name the routine after how many digits it handles For instance,

fp_mul_comba16() would perform a 16-by-16 multiplication.

Định dạng
Số trang	45
Dung lượng	249,18 KB