HandBooks Professional Java-C-Scrip-SQL part 129 ppt

The compute_ending_cache_hash function from superm\suplook.cpp Figure suplook.08 codelist/suplook.08 After determining the starting and ending positions where the key might be found in

Trang 1

The compute_hash function (from superm\suplook.cpp) (Figure suplook.06)

codelist/suplook.06

This may look mysterious, but it's actually pretty simple After clearing the hash code we are going to calculate, it enters a loop that first shifts the old hash code one (decimal) place to the left, end around, then adds the low four bits of the next character from the key to the result When it finishes this loop, it returns to the caller, in this case compute_cache_hash How did I come up with this

algorithm?

Making a Hash of Things

Well, as you will recall from our example of looking up a telephone number, the idea of a hash code is to make the most of variations in the input data, so that there will be a wide distribution of "starting places" for the records in the file If all the input values produced the same hash code, we would end up with a linear search again, which would be terribly slow In this case, our key is a UPC code, which is composed of decimal digits If each of those digits contributes equally to the hash code, we should be able to produce a fairly even distribution of hash codes, which are the starting points for searching through the file for each record As we noted earlier, this is one of the main drawbacks of hashing: the difficulty of coming up with a good hashing algorithm After analyzing the nature of the data, you may have to try a few different algorithms with some test data, until you get a good distribution of hash codes However, the effort is usually worthwhile, since you can often achieve an average of slightly over one disk access per lookup (assuming that several records fit in one physical disk record)

Meanwhile, back at compute_cache_hash, we convert the result of

compute_hash, which is an unsigned value, into an index into the cache This is then returned to lookup_record_and_number as the starting cache index As mentioned above, we are using an eight-way associative cache, in which

each key can be stored in any of eight entries in a cache line This means that we

need to know where the line starts, which is computed by

compute_starting_cache_hash (Figure suplook.07) and where it ends, which is computed by compute_ending_cache_hash (Figure suplook.08).9

The compute_starting_cache_hash function (from superm\suplook.cpp) (Figure suplook.07)

codelist/suplook.07

Trang 2

The compute_ending_cache_hash function (from superm\suplook.cpp) (Figure suplook.08)

codelist/suplook.08

After determining the starting and ending positions where the key might be found

in the cache, we compare the key in each entry to the key that we are looking for, and if they are equal, we have found the record in the cache In this event, we set the value of the record_number argument to the file record number for this cache entry, and return with the status set to FOUND

Otherwise, the record isn't in the cache, so we will have to look for it in the file; if

we find it, we will need a place to store it in the cache So we pick a "random" entry in the line (cache_replace_index) by calculating the remainder after dividing the number of accesses we have made by the MAPPING_FACTOR This will generate an entry index between 0 and the highest entry number, cycling

through all the possibilities on each successive access, thus not favoring a

particular entry number

However, if the line has an invalid entry (where the key is

INVALID_BCD_VALUE), we should use that one, rather than throwing out a real record that might be needed later Therefore, we search the line for such an empty entry, and if we are successful, we set cache_replace_index to its index Next, we calculate the place to start looking in the file, via

compute_file_hash, (Figure suplook.09), which is very similar to

compute_cache_hash except that it uses the FILE_SIZE constant in

superm.h (Figure superm.00a) to calculate the index rather than the

CACHE_SIZE constant, as we want a starting index in the file rather than in the cache

The compute_file_hash function (from superm\suplook.cpp) (Figure suplook.09)

codelist/suplook.09

As we noted above, this is another of the few drawbacks of this hashing method: the size of the file must be decided in advance, rather than being adjustable as data

is entered The reason is that to find a record in the file, we must be able to

calculate its approximate position in the file in the same manner as it was

calculated when the record was stored The calculation of the hash code is

designed to distribute the records evenly throughout a file of known size; if we

Trang 3

changed the size of the file, we wouldn't be able to find records previously stored

Of course, different files can have different sizes, as long as we know the size of the file we are operating on currently: the size doesn't have to be an actual constant

as it is in our example, but it does have to be known in advance for each file

Searching the File

Now we're ready to start looking for our record in the file at the position specified

by starting_file_index Therefore, we enter a loop that searches from this starting position toward the end of the file, looking for a record with the correct key First we set the file pointer to the first position to be read, using

position_record (Figure suplook.10), then read the record at that position

The position_record function (from superm\suplook.cpp) (Figure suplook.10)

codelist/suplook.10

If the key in that record is the one we are looking for, our search is successful On the other hand, if the record is invalid, then the record we are looking for is not in the file; when we add records to the file, we start at the position given by

starting_file_index and store our new record in the first invalid record we find.10 Therefore, no record can overflow past an invalid record, as the invalid record would have been used to store the overflow record

In either of these cases, we are through with the search, so we break out of the loop On the other hand, if the entry is neither invalid nor the record we are looking for, we keep looking through the file until either we have found the record we want, we discover that it isn't in the file by encountering an invalid record, or we run off the end of the file In the last case we start over at the beginning of the file

If we have found the record, we copy it to the cache entry we've previously

selected and copy its record number into the list of record numbers in the cache so that we'll know which record we have stored in that cache position Then we return

to the calling function, write_record, with the record we have found If we have determined that the record is not in the file, then we obviously can't read it into the cache, but we do want to keep track of the record number where we

stopped, since that is the record number that will be used for the record if we write

it to the file

Trang 4

To clarify this whole process, let's make a file with room for only nine records by changing FILE_SIZE to 6 in superm.h (Figure superm.00a) After adding a few records, a dump looks like Figure initcondition

Initial condition (Figure initcondition)

Position Key Data

0 INVALID

1 INVALID

2 0000098765: MINESTRONE:245

3 0000121212: OATMEAL, 1 LB.:300

4 INVALID

5 INVALID

6 0000012345: JELLY BEANS:150

7 INVALID

8 0000099887: POPCORN:99

Let's add a record with the key "23232" to the file Its hash code turns out to be 3,

so we look at position 3 in the file That position is occupied by a record with key

"121212", so we can't store our new record there The next position we examine, number 4, is invalid, so we know that the record we are planning to add is not in the file (Note that this is the exact sequence we follow to look up a record in the file as well) We use this position to hold our new record The file now looks like Figure aftermilk

After adding "milk" record (Figure aftermilk)

Position Key Data

0 INVALID

1 INVALID

2 0000098765: MINESTRONE:245

Trang 5

3 0000121212: OATMEAL, 1 LB.:300

4 0000023232: MILK:128

5 INVALID

6 0000012345: JELLY BEANS:150

7 INVALID

8 0000099887: POPCORN:99

Looking up our newly added record follows the same algorithm The hash code is still 3, so we examine position 3, which has the key "121212" That's not the

desired record, and it's not invalid, so we continue Position 4 does match, so we have found our record Now let's try to find some records that aren't in the file

If we try to find a record with key "98789", it turns out to have a hash code of 8 Since that position in the file is in use, but with a different key, we haven't found our record However, we have encountered the end of the file What next?

Wrapping Around at End-of-File

In order to continue looking for this record, we must start over at the beginning of the file That is, position 0 is the next logical position after the last one in the file

As it happens, position 0 contains an invalid record, so we know that the record we want isn't in the file.11 In any event, we are now finished with

lookup_record_and_number Therefore, we return to

lookup_record_number, which returns the record number to be used to

write_record (Figure suplook.01), along with a status value of FILE_FULL, FOUND, or NOT_IN_FILE (which is the status we want) FILE_FULL is an error, as we cannot add a record to a file that has reached its capacity So is

FOUND, in this situation, as we are trying to add a new record, not find one that alreadys exists In either of these cases, we simply return the status to the calling function, process, (Figure superm.02), which gives an appropriate error message and continues execution

However, if the status is NOT_IN_FILE, write_record continues by

positioning the file to the record number returned by

Trang 6

lookup_record_number, writing the record to the file, and returns the status NOT_IN_FILE to process, which continues execution normally

That concludes our examination of the input mode in process The lookup mode is very similar, except that it uses the lookup_record function (Figure suplook.03) rather than lookup_record_number, since it wants the record to be returned, not just the record number The lookup mode, of course, also differs from the entry

Định dạng
Số trang	6
Dung lượng	24,25 KB