HandBooks Professional Java-C-Scrip-SQL part 132 ppsx

For example, in Figure lesser.char, the asterisks mark the keys that start with B.5 They are in order by their second character, since we have just sorted on that character.. Therefore,

Trang 1

8 CC -CC

Now we are ready for the second, and final, sorting operation We start by counting the number of occurrences of each character in the first, or more significant, position There is one A in this position, along with four B's, and three C's Starting with the lowest character value, the key that has an A in this position ends up in the first slot of the output The B's follow, starting with the second slot Remember that the keys are already in order by their second, or less significant, character, because of our previous sort This is important when we consider the order of keys that have the same first character For example, in Figure lesser.char, the asterisks mark the keys that start with B.5 They are in order by their second character, since we have just sorted on that character Therefore, when we rearrange the keys by their first character position, those that have the same first character will be in order by their second character as well, since we always move keys from the input to the output in order of their position in the input That means that the B records have the same order in the output that they had in the input, as you can see in Figure greater.char This may seem like a lot of work to sort a few strings However, the advantage of this method when we have many keys to be sorted is that the processing for each pass is extremely simple, requiring only a few machine instructions per byte handled (less than 30 on the 80x86 family) More significant character sort (Figure greater.char) 1 BA -+ + -AB

| |

2 CA + +-+ -BA

| |

3 BA -+ -+ -BA

| |

4 AB -+ -+ + -BB

| |

5 CB +| | + BC

+++ -+ |

6 BB -+|+ -+ CA

++ -+

7 BC -++ -CB

8 CC -CC

Trang 2

On the Home Stretch

Having finished sorting the keys, we have only to retrieve the records we want from the input file in sorted order and write them to the output file In our example, this requires reading the desired records and writing them into an output file But how do we know which records we need to read and in what order?

The sort function requires more than just the keys to be sorted We also have to give it a list of the record numbers to which those keys correspond, so that it can rearrange that list in the same way that it rearranges the keys Then we can use the rearranged list of record numbers to retrieve the records in the correct order, which completes the task of our program

The Code

Let's start our examination of the mailing list program by looking at the header file that defines its main constants and structures, which is shown in Figure mail.00a

The main header file for the mailing list program (mail\mail.h) (Figure

mail.00a)

codelist/mail.00a

Now we're ready to look at the implementation of these algorithms, starting with function main (Figure mail.00)

main function definition (from mail\mail.cpp) (Figure mail.00)

codelist/mail.00

This function begins by checking the number of arguments with which it was

called, and exits with an informative message if there aren't enough Otherwise, it constructs the output file name and opens the file for binary input Then it calls the initialize function (Figure mail.01), which sets up the selection criteria

according to input arguments 3 through 6 (minimum spent, maximum spent,

earliest last-here date, and latest last-here date) Now we are ready to call

process (Figure mail.02), to select the records that meet those criteria

initialize function definition (from mail\mail.cpp) (Figure mail.01)

codelist/mail.01

Trang 3

process function definition (from mail\mail.cpp) (Figure mail.02)

codelist/mail.02

The first order of business in process is to set up the buffering for the list

(output), and data files It is important to note that we are using a large buffer for the list file and for the first pass through the data file, but are changing the buffer size to the size of 1 record for the second pass through the data file What is the reason for this change?

Determining the Proper Buffer Size

On the first pass through the data file, we are going to read every record in physical order, so a large buffer is useful in reducing the number of physical disk accesses needed

This analysis, however, does not apply to the second pass through the data file In this case, using a bigger buffer for the data file would actually reduce performance, since reading a large amount of data at once is helpful only if you are going to use the data that you are reading.6 On the second pass, we will read the data records in order of their ZIP codes, forcing us to move to a different position in the data file for each record rather than reading them consecutively Using a big buffer in this situation would mean that most of the data in the buffer would be irrelevant

Preparing to Read the Key File

Continuing in process, we calculate the number of records in the data file,

which determines how large our record selection bitmap should be Then we call the macro allocate_bitmap, which is defined in bitfunc.h (Figure

bitfunc.00a), to allocate storage for the bitmap

The header file for the bitmap functions (mail\bitfunc.h) (Figure bitfunc.00a)

codelist/bitfunc.00a

Of course, each byte of a bitmap can store eight bits, so the macro divides the number of bits we need by eight and adds one byte to the result The extra byte is

to accommodate any remainder after the division by eight

Now that we have allocated our bitmap, we can read through the data file and select the records that meet our criteria After initializing our counts of "items read" and "found" to zero, we are ready to start reading records Of course, we

Trang 4

could calculate the number of times through the loop rather than continue until we run off the end of the input file, since we know how many records there are in the file However, since we are processing records in batches, the last of which is likely to be smaller than the rest, we might as well take advantage of the fact that when we get a short count of items_read, the operating system is telling us that

we have reached the end of the file

Reading the Key File

The first thing we do in the "infinite" loop is to read a set of

processing_batch records (to avoid the overhead of calling the operating system to read each record) Now we are ready to process one record at a time in the inner loop Of course, we want to know whether the record we are examining meets our selection criteria, which are whether the customer has spent at least min_spent, no more than max_spent, and has last been here between

min_date and max_date (inclusive) If the record fails to meet any of these criteria, we skip the remainder of the processing for this record via "continue" However, let's suppose that a record passes these four tests

In that case, we increment items_found Then we want to set the bit in the found bitmap that corresponds to this record To do this, we need to calculate the current record number, by adding the number of records read before the current processing batch (total_items_read) and the entry number in the current batch (i) Now we are ready to call setbit (Figure bitfunc.00)

setbit function definition (from mail\bitfunc.cpp) (Figure bitfunc.00)

codelist/bitfunc.00

Setting a Bit in a Bitmap

The setbit function is quite simple Since there are eight bits in a byte, we have

to calculate which byte we need to access and which bit within that byte Once we have calculated these two values, we can retrieve the appropriate byte from the bitmap

In order to set the bit we are interested in, we need to create a "mask" to isolate that bit from the others in the same byte The statement that does this, mask = 1 << bitnumber;, may seem mysterious, but all we are doing is generating a value that has a 1 in the same position as the bit we are interested in and 0 in all other positions Therefore, after we perform a "logical or" operation of the mask and the

Trang 5

byte from the bitmap, the resulting value, stored back into the bitmap, will have the desired bit set

This setbit function also returns a value indicating the value of the bit before we set it Thus, if we want to know whether we have actually changed the bit from off

to on, we don't have to make a call to testbit before the call to setbit; we can use the return value from setbit to determine whether the bit was set before

we called setbit This would be useful, for example, in an application where the bitmap was being used to allocate some resource, such as a printer, which cannot

be used by more than one process at a time The function would call setbit and,

if that bit had already been set, would return an error indicating that the resource was not available

Now we have a means of keeping track of which records have been selected

However, we also need to save the ZIP code for each selected record for our sort Unfortunately, we don't know how many records are going to be selected until we select them This is easily dealt with in the case of the bitmap, which is so

economical of storage that we can comfortably allocate a bit for every record in the file; ZIP codes, which take ten bytes apiece, pose a more difficult problem We need a method of allocation which can provide storage for an unknown number of ZIP codes

Allocate as You Go

Of course, we could use a simple linked list In that approach, every time we found

a record that matches our criteria, we would allocate storage for a ZIP code and a pointer to the next ZIP code However, this consumes storage very rapidly, as additional memory is required to keep track of every allocation of storage When very small blocks of ten bytes or so are involved, the overhead can easily exceed the amount of storage actually used for our purposes, so that allocating 250000 14-byte blocks can easily take 7.5 mega14-bytes or more, rather than the 3.5 mega14-bytes that we might expect

To avoid this inefficiency, we can allocate larger blocks that can accommodate a number of ZIP codes each, and keep track of the addresses of each of these larger blocks so that we can retrieve individual ZIP codes later That is the responsibility

of the code inside the if statement that compares current_zip_entry with

ZIP_BLOCK_ENTRIES.7 To understand how this works, let's look back at the lines in process that set current_zip_block to -1 and current_zip_entry to

Trang 6

ZIP_BLOCK_ENTRIES This initialization ensures that the code in the "if" will be executed for the first selected record We start by incrementing current_zip_block (to zero, in this case) and setting current_zip_entry to zero, to start a new block Then we allocate storage for a new block of ZIP codes (zip_block) and set up

Định dạng
Số trang	6
Dung lượng	25,27 KB