Putting It All Together This chapter introduced the two standard encrypt and authenticate modes as specified byboth NIST and IEEE.They are both designed to take a single key and IV nonce
Trang 1In most modern operating systems, the memory used by a program (or process)
is known as virtual memory The memory has no fixed physical address and can
be moved between locations and even swapped to disk (through page
invalida-tion) This latter action is typically known as swap memory, as it allows users to
emulate having more physical memory than they really do
The downside to swap memory, however, is that the process memory couldcontain sensitive information such as private keys, usernames, passwords, and
other credentials To prevent this, an application can lock memory In operating
systems such as those based on the NT kernel (e.g., Win2K, WinXP), locking isentirely voluntary and the OS can choose to later swap nonkernel data out
In POSIX compatible operating systems, such as those based on the Linuxand the BSD kernels, a set of functions such as mlock(), munlock(), mlockall(),and so forth have been provided to facilitate locking Physical memory in mostsystems can be costly, so the polite and proper application will request to lock
as little memory as possible In most cases, locked memory will span a region
that contains pages of memory On the x86 series of processors, a page is four
kilobytes This means that all locked memory will actually lock a multiple of fourkilobytes
Ideally, an application will pool its related credentials to reduce the number
of physical pages required to lock them in memory
plain-is nicely bundled up in a single function call, making its deployment rather trivial
Putting It All Together
This chapter introduced the two standard encrypt and authenticate modes as specified byboth NIST and IEEE.They are both designed to take a single key and IV (nonce) and pro-duce a ciphertext and message authentication code tag, thereby simplifying the process fordevelopers by reducing the number of different standards they must support, and in practicethe number of functions they have to call to accomplish the same results
www.syngress.com
338 Chapter 7 • Encrypt and Authenticate Modes
Trang 2Knowing how to use these modes is a matter of properly choosing an IV, making idealuse of the additional authentication data (AAD), and checking the MAC tag they produce.
Neither of these two modes will manage any of these properties for the developer, so they
must look after them carefully
For most applications, it is highly advisable to use these modes over an ad hoc tion of encryption and authentication, if not solely for the reason of code simplicity, then
combina-also for proper adherence to cryptographic standards
What Are These Modes For?
We saw in the previous chapter how we could accomplish both privacy and authentication
of data through the combined use of a symmetric cipher and chaining mode with a MAC
algorithm Here, the goal of these modes is to combine the two.This accomplishes several
key goals simultaneously As we alluded to in the previous chapter, CCM and GCM are also
meant for small packet messages, ideal for securing a stream of messages between parties
CCM and GCM can be used for offline tasks such as file encryption, but they are not meant
for such tasks (especially CCM since it needs the length of the plaintext in advance)
First, combining the modes makes development simpler—there is only one key and one
IV to keep track of.The mode will handle using both for both tasks.This makes key
deriva-tion easier and quicker, as less session data must be derived It also means there are fewer
variables floating around to keep track of
These combined modes also mean it’s possible to perform both goals with a single tion call In code where we specifically must trap error codes (usually by looking at the
func-return codes), having fewer functions to call means the code is easier to write safely While
there are other ways to trap errors such as signals and internal masking, making threadsafe
global error detection in C is rather difficult
In addition to making the code easier to read and write, combined modes make thesecurity easier to analyze CCM, for instance, is a combination of CBC-MAC and CTR
encryption mode In various ways, we can reduce the security of CCM to the security of
these modes In general, with a full length MAC tag, the security of CCM reduces to the
security of the block cipher (assuming a unique nonce and random key are used)
What we mean by reduce is that we can make an argument for equivalence For example,
if the security of CTR is reducible to the security of the cipher, we are saying it is as secure
as the latter By this reasoning, if one could break the cipher, he could also break CTR
mode (Strictly speaking, the security of CTR reduces to the determination of whether the
cipher is a PRP.)
So, in this context, if we say CCM reduces to the security of the cipher in terms ofbeing a proper pseudo-random permutation (PRP), then if we can break the cipher (by
showing it is not a PRP), we can likely break CCM Similarly, GCM reduces to the security
of the cipher for privacy and to universal hashing for the MAC It is more complicated to
prove that it can be secure.
www.syngress.com
Encrypt and Authenticate Modes • Chapter 7 339
Trang 3Choosing a Nonce
Both CCM and GCM require a unique nonce (N used once) value to maintain their vacy and authenticity goals In both cases, the value need not be random, but merely uniquefor a given key.That is, you can safely use the same nonce (only once, though) between twodifferent keys Once you use the nonce for a particular key, you cannot use it again
pri-GCM Nonces
GCM was designed to be most efficient with 12-byte nonce values Any longer or shorterand GHASH is used to create an IV for the mode In this case, we can simply use the 12-byte nonce as a packet counter Since we have to send the nonce to the other party anyway,this means we can double up on the purpose of this field Each packet would get its own12-byte nonce (by incrementing it), and the receiver can check for replays and out of orderpackets by checking the nonce as if it were a 96-bit number
You can use the 12-byte number as either a big or little endian value, as GCM will nottruncate the nonce
Additional Authentication Data
Both CCM and GCM support a sort of side channel known as additional authenticationdata (AAD).This data is meant to be nonprivate data that should influence the MAC tagoutput.That is, if the plaintext and AAD are not present together and unmodified, the tagshould reflect that
The usual use for AAD is to store session metadata along with the packet.Things such asusername, session ID, and transaction ID are common.You would never use a user credential,since it would not really be something you need on a per-packet basis
Both protocols support empty AAD strings Only GCM is optimized to handle AADstrings that are a multiple of 16 bytes long CCM inserts a four- or six-byte header that off-
www.syngress.com
340 Chapter 7 • Encrypt and Authenticate Modes
Trang 4sets the data and makes it harder to optimize for In general, if you are using CCM, try to
have very short AAD strings, preferably less than 10 bytes, as you can cram that into a single
encrypt invocation For GCM, try to have your AAD strings a multiple of 16 bytes even if
you have to pad with zero bytes (the implementation cannot do this padding for you, as it
would change the meaning of the AAD)
MAC Tag Data
The MAC tag produced by both implementations is not checked internally.The typical
usage would involve transmitting the MAC tag with the ciphertext, and the recipient would
compare it against the one he generated while decrypting the message
In theory at least, you can truncate the MAC tag to short lengths such as 80 or 96 bits
However, some evidence points to the contrary with GCM, and in reality the savings are
trivial As the research evolves, it would be best to read up on the latest analysis papers of
GCM and CCM to see if short tags are still in fact secure
In practice, you can save more space if you aggregate packets over a stable channel Forexample, instead of sending 128-byte packets, send 196- or 256-byte packets.You will send
fewer nonces (and protocol data), and the difference can allow you to use a longer MAC tag
Clearly, this does not work in low latency switching cases (e.g., VoIP), so it is not a
bullet-proof suggestion
Example Construction
For our example, we will borrow from the example of Chapter 6, except instead of using
HMAC and CTR mode, we will use CCM.The rest of the demo works the same Again, we
will be using LibTomCrypt, as it provides a nice CCM interface that is very easy to use in
We only have one key and its 16 bytes long.
005 /* Our Per Packet Sizes, Nonce len and MAC len */
006 #define NONCELEN 4
007 #define MACLEN 12
008 #define OVERHEAD (NONCELEN+MACLEN)
As in the previous example, we have a packet counter length (the length of our nonce),MAC tag length, and the composite overhead Here we have a 32-bit counter and a 96-bit
Trang 5014 /* our nice containers */
039 unsigned char tmp[2*KEYLEN];
040 unsigned long tmplen;
Trang 6We use our packet counter PktCTR as the a method of keeping the packets in order.
We also use it as a nonce for the CCM algorithm.The output generated will first consist of
the nonce, followed by the ciphertext, and then the MAC tag
Trang 7This function call will generate the ciphertext and MAC tag for us Libraries are nice.
120 out+inlen+NONCELEN, &taglen, CCM_ENCRYPT))
These are the plaintext and ciphertext buffers.The plaintext is always specified firstregardless of the direction we are using It may get a bit confusing for those expecting an
“input” and “output” order of arguments
127 int decode_frame(const unsigned char *in,
133 unsigned char tag[MACLEN];
134 unsigned long taglen;
Trang 8137 if (inlen < MACLEN+NONCELEN) { return -1; }
138 inlen -= MACLEN+NONCELEN;
As before, we assume that inlen is the length of the entire packet and not just the
plain-text We first make sure it is a valid length, and then subtract the MAC tag and nonce from
For both encryption and decryption, the ccm_memory() function is used In decrypt
mode, it is used much the same, except our out buffer goes in the plaintext argument
posi-tion.Therefore, it appears before the input, which seems a bit awkward
Note that we store the MAC tag locally since we need to compare it next
165 memcpy(stream->channels[1].PktCTR, in, NONCELEN);
At this point, our packet is valid, so we copy the nonce out and store it locally, thus venting future replay attacks
Trang 9Q: What is an Encrypt and Authenticate algorithm?
A: These are algorithms designed to accept a single key and IV, and allow the caller to bothencrypt and authenticate his message Unlike distinct encrypt and authenticate protocols,these modes do not require two independent keys for security
Q: What are the benefits of these modes over the distinct modes?
A: There are two essential benefits to these modes First, they are easier to incorporate intocryptosystems, as they require fewer keys, and perform two functions at the same time.This means a developer can accomplish two goals with a single function call It alsomeans they are more likely to start authenticating traffic.The second benefit is that theyare often reducible to the security of the underlying cipher or primitive
Q: What does reducible mean?
A: When we say one problem is reducible to another, we imply that solving the formerinvolves a solution to the latter For instance, the security of CTR-AES reduces to theproblem of whether AES is a true pseudo random permutation (PRP) If AES is not aPRP, CTR-AES is not secure
Q: What Encrypt and Authenticate modes are standardized?
A: CCM is specified in the NIST publication SP 800-38C GCM is an IEEE standard andwill be specified in the NIST publication SP 800-38D in the future
Q: Should I use these modes over distinct modes such as AES-CTR and SHA1-HMAC?
A: That depends on whether you are trying to adhere to an older standard If you are not,most likely GCM or CCM make better use of system resources to accomplish equiva-lent goals.The distinct modes can be just as secure in practice, but are generally harder
to implement and verify in fielded applications.They also rely on more security tions, which poses a security risk
assump-www.syngress.com
346 Chapter 7 • Encrypt and Authenticate Modes
Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,are designed to both measure your understanding of the concepts presented in this chapter and to assist you with real-life implementation of these concepts Tohave your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form
Trang 10Q: What is a nonce and what is the importance of it?
A: A nonce is a parameter (like an initial value) that is meant to be used only once Nonces
are used to introduce entropy into an otherwise static and fully deterministic process(that is, after the key has been chosen) Nonces must be unique under one key, but may
be reused across different keys Usually, they are simply packet counters Without aunique nonce, both GCM and CCM have no provable security properties
Q: What is additional authentication data? AAD? Header data?
A: Additional authentication Data (AAD), also known as header data and associated data (in
EAX), is data related to a message, that must be part of the MAC tag computation butnot encrypted.Typically, it is simple metadata related to the session that is not private butaffects the interpretation of the messages For instance, an IP datagram encoder coulduse parts of the IP datagram as part of the header data.That way, if the routing ischanged it is detectable, and since the IP header cannot be encrypted (without usinganother IP standard), it must remain in plaintext form.That is only one example ofAAD; there are others In practice, it is not used much but provided nonetheless
Q: Which mode should I use? GCM or CCM?
A: That depends on the standard you are trying to comply with (if any) CCM is
cur-rently the only NIST standard of the two If you are working with wireless standards,chances are you will need GCM Outside of those two cases, it depends on the plat-form GCM is very fast, but requires a huge 64-kilobyte table to achieve this speed (atleast in software) CCM is a slight bit slower, but is much simpler in design and imple-mentation CCM also does not require huge tables to achieve its performance claims
If you removed the tables from GCM, it would be much slower than CCM
Q: Are there any patents on these modes?
A : Neither of GCM and CCM is patented.They are free for any use
Q: Where can I find implementations?
A: Currently, only LibTomCrypt provides both GCM and CCM in a library form
Brian Gladman has independent copyrighted implementations available Crypto++
has neither, but the author is aware of this
Q: What do I do with the MAC tag?
A: If you are encrypting, transmit it along with your ciphertext and AAD data (if any)
If you are decrypting, the decryption will generate a MAC tag; compare that withthe value stored along with the ciphertext If they do not compare as equal, themessage has been altered or the stored MAC tag is invalid
www.syngress.com
Encrypt and Authenticate Modes • Chapter 7 347
Trang 12Large Integer Arithmetic
Solutions in this chapter:
■ What Are BigNums?
■ Why Do We Need Them for Cryptography?
■ What Algorithms Are Most Important?
Solutions Fast Track
Frequently Asked Questions
Trang 13So far, we have been examining symmetric key algorithms that rely solely on secret keys forsecurity Now we are going to explore the realm of public key cryptography, but before wecan do this, we have a significant piece of mathematics to cover
Most standard public key algorithms are based on problems that are hard to solve ingeneral For example, the RSA algorithm is (loosely speaking) as secure as factoring is hard.That is, if factoring is hard, breaking RSA is, too (in practice) Similarly, elliptic curve algo-rithms are as hard to break as inverting point multiplication on the given curve
In both cases, the “problem” becomes harder as you increase the size of the problem Inthe case of RSA, as you increase the composite (public key), factoring becomes harder.Similarly, as you increase the order of the elliptic curve (do not worry if you do not knowwhat that means at this point), the difficulty of inverting point multiplication increases
To accommodate these larger parameters, we must deploy algorithms known collectively
as BigNum algorithms
What Are BigNums?
As developers, you are most likely aware of the size limitations of your preferred data types
In C, for example, the int data type can represent only up to 32767 in a portable fashion.
Even though on certain platforms it may be 32 or even 64 bits in length, you cannot alwayscount on that
Most programming languages at best have a 64-bit data type at their disposal If we had
to live within these constraints directly, we would have very insecure public key algorithms.Factoring 64-bit numbers is a trivial task, for instance
To work around these limitations, we use algorithms typically known as either multiple
or fixed precision algorithms These algorithms construct representations of larger integers
using the supported data type (such as unsigned long) as a digit—much like we construct
larger numbers in decimal by appending more digits in the range 0–9
The digits (also known as limbs in various libraries) form the basis of all arithmetic.
They are typically chosen to be efficient on the target platform For example, if your
machine can perform operations on unsigned long data types efficiently, that will likely be
your digit type The math we must perform works much the same as was taught in schoolwith decimal The only difference being that instead of using 10 as our radix we use apower of 2, such as 232
Fixed and multiple precision algorithms differ little in theory, and mostly in tion In multiple precision algorithms, we seek to accommodate any size integer by allo-cating new memory as required to represent the result of an operation This is nice to have
implementa-if we are dealing with numbers of unknown dimensions However, it has the overhead ofperforming heap operations in the middle of calculations, which tend to be rather slow.Fixed precision algorithms only have a limited (fixed) space to store the representation of a
www.syngress.com
350 Chapter 8 • Large Integer Arithmetic
Trang 14number As such, there is no need for heap operations in the middle of a calculation Fixed
precision is well suited for tasks where the dimensions of the inputs are known in advance
For example, if you are implementing ECC P-192, you know your largest number will be
384 bits (2*192) and you can adjust accordingly
In this text, we focus on fixed precision implementation details, as they are more cient We dispense with much of the discussion of BigNum math and instead defer the
effi-reader to various sources
Further Resources
For the curious reader, various other sources cover BigNum mathematics in greater depth
The book BigNum Math covers the topic by presenting the development of a public domain
math library, LibTomMath It is certainly worth a read for those new to the subject The
text The Art of Computer Programming Volume 2 also covers the topic from a theoretical angle,
presenting key mathematical concepts behind efficient algorithms
The reader should also examine the freely available source code for the LibTomMathand TomsFastMath packages, which implement multiple and fixed (respectively) precision
arithmetic The source codes for both are well documented and free to use for any purpose
Key Algorithms
Certain key algorithms when implemented correctly can make a BigNum library perform
very well Even though there are dozens of algorithms involved in something like an elliptic
curve point addition, there are only four algorithms of critical importance: multiplication,
squaring, reduction, and modular inversion
When performing typical public key operations, the processor time usage can usuallybroken down from most to least to reduction, multiplication, squaring, and modular inver-
sion—which gives you a good idea of where to spend the most amount of time when
optimizing
In practice, we are talking about the same algorithms that are taught to young children for things like multiplication The key difference is how they are implemented Things like
accumulator structure, loop unrolling, and opcode selection can make all the difference We
will present the four algorithms using code suitable for x86, PPC, and ARM processors using
GCC as our development tool of choice
The Algorithms
Represent!
Before we can dive into the algorithms required for efficient public key cryptography, we
must discuss the structure we will use to represent these integers For our purposes, we will
borrow from the TomsFastMath package, which uses a structure similar to the following:
www.syngress.com
Large Integer Arithmetic • Chapter 8 351
Trang 15where fp_digit is a type for a single digit, usually equivalent to unsigned long but can change
to suit the platform Along with that type is an fp_word type that we have not seen yet It is required to be twice the size of an fp_digit and be able to hold the product of two maximum valued fp_digit variables The FP_SIZE macro is the default maximum size of an integer It
is based on the size of a digit and the number of bits required For instance, if fp_digit is a 32-bit type and you wish to represent up to 384-bit integers, FP_SIZE must be at least 12 The used flag indicates how many digits of the dp array are nonzero This allows us to
manipulate shorter integers more efficiently For example, just because your integers can be
up to (say) 12 digits does not mean they all are going to be that big Performing 12-digitoperations on numbers that may only be 6 digits long wastes time and resources
The sign flag denotes the sign of the integer It is 0 for non-negative, and nonzero for
negative This means that our integers are always unsigned and only this flag controls thebehavior
The digits of the fp_int type are used in little endian order fashion For example, with a 32-bit fp_digit if the dp array contained {1, 2, 3, 4, 0, 0, 0, , 0}, that would represent the
integer 1 + 2*232+ 3*264+ 4*296
Multiplication
Multiplication, much like most BigNum problems, can be solved with a variety of rithms Algorithms such as Karatsuba and Toom-Cook linearized multiplication claim effi-cient asymptotic time requirements, but are in fact not that useful in practice—at least, notwith typical software platforms They do come in handy with hardware platforms, however
algo-It turns out the most profitable way to multiply two numbers of the size we use in
public key algorithms is to use the basic O(n2) long-hand algorithm That is, if we are
multi-plying A by B, we would multiply every digit of B by every digit of A and accumulate (add up) all of the products Assuming that A and B have the same number of digits, n, this pro- cess requires n 2 single precision multiplications (that is, multiplications of fp_digit types)
(Figure 8.1)
www.syngress.com
352 Chapter 8 • Large Integer Arithmetic
Trang 16Figure 8.1 Multiplication Algorithm
This algorithm produces one digit of the product per iteration of the loop (started on
step 5) We use a three-digit accumulator {c0, c1, c2} to accumulate the products inside the
inner loop This is accomplished by adding the two-digit product to the accumulator, and
then propagating the carry to the third digit It may sound complicated but, as we will see
shortly, it is entirely efficient
Outside of the inner loop, the lowest digit of the accumulator holds the digit of the
product in question We store c0 to C and then shift the accumulator down—c1 becomes
c0, c2 becomes c1, and then c2 is zeroed.
The MIN() macro is used to determine the smallest of the two operands Using it inthe loop may seem like a branch hazard, and in practice, it is However, as we will see, loop
unrolling can remove this from our code The performance of the implementation depends
mostly on the ability to perform the single inner loop step efficiently
www.syngress.com
Large Integer Arithmetic • Chapter 8 353
Trang 17Lets examine the generic multiply from TomsFastMath before we look at the macrosthat make this feasible.
Ripped from fp_mul_comba.c:
So far, this is much like our pseudo code (which was actually derived from this code)
025 for (ix = 0; ix < pa; ix++) {
026 /* get offsets into the two bignums */
034 /* this is the number of times the loop will iterrate
035 while (tx++ < a->used && ty >= 0) { }
037 iy = MIN(A->used-tx, ty+1);
At this point, our inner loop is ready to execute We use pointer aliases tmpx and tmpy to
point to the digits of A and B, respectively This makes our inner loop simpler
Trang 18042 MULADD(*tmpx++, *tmpy );
COMBA_FORWARD performs the operation of shifting the accumulator MULADD
performs the operation of {c0, c1, c2} += *tmpx * *tmpy At this point, we have no idea
how this will be done The point of showing this coding technique is to illustrate the power
At this point, we have C code for multiplication, but we have no idea how the most vital
portions of the code actually work We have cleverly hidden their details in a series of C
preprocessor macros Why would we do this? To make the code harder to follow? To make
it harder for others to use? The answer is diversity.
Our macro scheme is the basis of a library known as TomsFastMath It currently targets
four different architectures without re-writing significant portions of the code Our macros
are flexible enough to allow us to work with x86 in 32- and 64-bit mode (as well with
SSE2), 32 and 64-bit PPC, and ARMv4 platforms—even though the instruction sets of the
x86, PPC, and ARM do not resemble one another
First, let’s examine the portable C implementation of these macros just to get a goodidea of what they are supposed to accomplish
#define COMBA_START
This macro has no use inside the logic of the multiplication other than to make it easier
to support various platforms
Trang 19This macro zeros the accumulator.
These two macros store values out of the accumulator We have seen
COMBA_STORE, but not COMBA_STORE2, as it is part of the unrolled code
t = (fp_word)c0 + ((fp_word)i) * ((fp_word)j); c0 = t; \
t = (fp_word)c1 + (t >> DIGIT_BIT);c1 = t; c2 += t >> DIGIT_BIT; \
} while (0);
This performs the inner loop multiplication and accumulate We use a double precision
fp_word to hold the product and then add it to the accumulator By using double precision
types, we avoid the otherwise required test for overflow This is because the C language lacks
an “add with carry” operation otherwise required to propagate carry bits
So far, of all the platforms we support with our code, the only macro that changes isMULADD Table 8.1 lists various platforms and their respective macros
Table 8.1MULADD Macros for Various Platforms
www.syngress.com
356 Chapter 8 • Large Integer Arithmetic
Continued
Trang 20Table 8.1 continuedMULADD Macros for Various Platforms
x86_32 + SSE2 #define MULADD(i, j)\
www.syngress.com
Large Integer Arithmetic • Chapter 8 357
Continued
Trang 21Table 8.1MULADD Macros for Various Platforms
All five macros accomplish the same goal, but with different architectures in mind They
all multiply i by j and accumulate the product in {c0, c1, c2}.
The x86_32 and x86_64 macros use the MUL instruction from the x86 instruction set
They load the i operand into the EAX (RAX, resp.) register and then perform a multiply against the fp_digit pointed to by j The product in the x86 instruction set is always stored in
the EDX:EAX (RDX:RAX resp.) registers This product is then accumulated into the
three-digit accumulator {c0, c1, c2}, which we ask GCC to alias to processor registers When
GCC has parsed this macro, it turns into assembler output resembling the following
movq (%rsp),%rax mulq 8(%rsp) addq %rax,%rcx adcq %rdx,%rsi adcq $0,%rdi
For those who cannot read x86_64 assembler, GCC has assigned {c0, c1, c2} to {rcx, rsi,
rdi}, three processor registers This is partly what makes the code so efficient The productsare accumulated without additional memory access
The x86_32 SSE2 code is meant for Pentium 4 Northwood (before Prescott) processors
In these processors, the FPU is used for all integer MUL instructions, which means that ifyou simply use the FPU directly, you can get the product faster Intel later improved theircores and this is no longer the case This code is not efficient on AMD processors, at leastnot compared to the integer multiplier AMD has
It would seem that using SSE2 to perform two 32-bit multiplications in parallel would
be a faster way to perform multiplication This is not true, however The AMD64 (andOpteron) series of processors can perform a single 64-bit multiplication in roughly five pro-cessor cycles You would need four times as many 32-bit multiplications to accomplish whatyou could with 64-bit multiplications Therefore, even if you are doing two at once, youhave to accomplish them in less than half the time to become faster Currently, the SSE2multiplication is not a single cycle on AMD64 processors, nor will it be in the near future
www.syngress.com
358 Chapter 8 • Large Integer Arithmetic
Trang 22The PPC32 code is another straight adaptation to the instruction set The PPC differsfrom other instruction sets in that only half of the product is produced per opcode The
mullw instruction produces the lower 32 bits of the product, while the mulhwu produces
higher 32 bits We could place both multiplications back to back; however, we want to avoid
clobbering1as many registers as possible (clobbering is the GCC term for destroying the
con-tents of a register inside an assembler macro) This macro clobbers only one register and
requires three others for the accumulator
The PPC64 instruction set has similar opcodes; for instance, mulld and mulhdu perform
the equivalent 64-bit multiplications Those are currently not in the project due to lack of
access to a PPC64 machine
The ARMv4 code requires a v4 core with the M features (e.g., ARM7TDMI) or a v5
core or higher It is perhaps the most elegant of all the code We store the product in r0
and r1 and then accumulate it The astute reader may notice we do not use the ARMv4
“multiply and accumulate” instruction This is because it does not set the carry flag
Therefore, we could not pack 32 bits per digit if we used it
Code Unrolling
So, now we have the general technique, and macros to plug into the code Now we need to
organize the code for real raw performance The real boost comes from unrolling the entire
inner and outer loops If we know in advance what size numbers we are multiplying, this
can pay off big
Now, instead of unrolling a multiply by hand, or telling GCC implicitly the size of ournumbers, we will craft a routine that is fully explicitly unrolled To do this, we will write a
program that emits C source code for a multiplier For reference, we will use another source
file from the TomsFastMath project This program accepts a single dimension N as input and
produces the C source code for a N-by-N multiply
ripped from comba_mult_gen.c:
011 /* program emits a NxN comba multiplier */
021 "void fp_mul_comba%d(fp_int *A, fp_int *B, fp_int *C)\n"
We will name the routine after how many digits it handles For instance,
fp_mul_comba16() would perform a 16-by-16 multiplication.