Making a Quantum Leap Upon further consideration, I realized that the quantum file access method could be used to store a variable-length "storage element" pointed to by each hash slot,
Trang 1all we have to do is to increase the number of active slots by 1 (to 10) and rehash the elements in slot 1 It works just like the example above; none of the slots 2-7 is affected by the change, because the second rule "folds" all of their hash
calculations into their current slots
When we have added a total of 96 elements, the number of "active" slots and the total number of slots will both be 16, so we will be back to a situation similar to the one we were in before we added our first expansion slot What do we do when we have to add the 97th element? We double the total number of slots again and start over as before.7 This can go on until the number of slots gets to be too big to fit in the size of integer in which we maintain it, or until we run out of memory,
whichever comes first
Of course, if we want to handle a lot of data, we will probably run out of memory before we run out of slot numbers Since the ability to retrieve any record in any size table in constant time, on the average, would certainly be very valuable when dealing with very large databases, I considered that it was worth trying to adapt this algorithm to disk-based hash tables All that was missing was a way to store the chains of records using a storage method appropriate for disk files
Making a Quantum Leap
Upon further consideration, I realized that the quantum file access method could be used to store a variable-length "storage element" pointed to by each hash slot, with the rest of the algorithm implemented pretty much as it was in the article This would make it possible to store a very large number of strings by key and get back any of them with an average of a little more than one disk access, without having
to know how big the file would be when we created it This algorithm also has the pleasant effect of making deletions fairly simple to implement, with the file storage
of deleted elements automatically reclaimed as they are removed
Contrary to my usual practice elsewhere in this book, I have not developed a
sample "application" program, but have instead opted to write a test program to validate the algorithm This was very useful to me during the development of the algorithm; after all, this implementation of dynamic hashing is supposed to be able
to handle hundreds of thousands of records while maintaining rapid access, so it is very helpful to be able to demonstrate that it can indeed do that The test program
Trang 2The test program for the dynamic hashing algorithm (quantum\hashtest.cpp) (Figure hashtest.00)
codelist/hashtest.00
This program stores and retrieves strings by a nine-digit key, stored as a String value To reduce the overhead of storing each entry as a separate object in the quantum file, all strings having the same hash code are combined into one "storage element" in the quantum file system; each storage element is addressed by one
"hash slot" The current version of the dynamic hashing algorithm used by
hashtest.cpp allocates one hash slot for every six strings in the file; since the average individual string length is about 60 characters, and there are 18 bytes of overhead for each string in a storage element (four bytes for the key length, four bytes for the data length, and 10 bytes for the key value including its null further), this means that the average storage element will be a bit under 500 bytes long A larger element packing factor than six strings per element would produce a smaller hash table and would therefore be more space efficient However, the choice of this value is not critical with the current implementation of this algorithm, because any storage elements that become too long to fit into quantum will be broken up and stored separately by a mechanism which I will get to later
Of course, in order to run meaningful tests, we have to do more than store records
in the hash file; we also have to retrieve what has been stored, which means that
we have to store the keys someplace so that we can use them again to retrieve records corresponding to those keys In the current version of this algorithm, I use
a FlexArray (i.e., a persistent array of strings, such as we examined in the
previous chapter) to store the values of the keys However, in the original version
of this algorithm, I was storing the key as an unsigned long value, so I decided to use the quantum file storage to implement a persistent array of unsigned long values, and store the keys in such an array
Persistence Pays Off
It was surprisingly easy to implement a persistent array of unsigned long values,8 for which I defined a typedef of Ulong, mostly to save typing The header file for this data type is persist.h (Figure persist.00)
The interface for the PersistentArrayUlong class (quantum\persist.h) (Figure persist.00)
codelist/persist.00
Trang 3As you can see, this class isn't very complex, and most of the implementation code is also fairly straightforward However, we get a lot out of those relatively few lines of code; these arrays are not only persistent, but they also automatically expand to any size up to and including the maximum size of a quantum file; with the current maximum of 10000 16K blocks, a maximum size
PersistentArrayUlong could contain approximately 40,000,000 elements!
Of course, we don't store each element directly in a separate addressable entry within a main object, as this would be inappropriate because the space overhead per item would be larger than the Ulongs we want to store! Instead, we employ a two-level system similar to the one used in the dynamic hashing algorithm; the quantum file system stores "segments" of data, each one containing as many
Ulongs as will fit in a quantum To store or retrieve an element, we determine which segment the element belongs to and access the element by its offset in that segment
However, before we can use a PersistentArrayUlong, we have to construct
it, which we can do via the default constructor (Figure persist.01)
The default constructor for the PersistentArrayUlong class (from
quantum\persist.cpp) (Figure persist.01)
codelist/persist.01
This constructor doesn't actually create a usable array; it is only there to allow us to declare a PersistentArrayUlong before we want to use it When we really want to construct a usable array, we use the normal constructor shown in Figure persist.02
The normal constructor for the PersistentArrayUlong class (from
quantum\persist.cpp) (Figure persist.02)
codelist/persist.02
As you can see, to construct a real usable array, we provide a pointer to the
quantum file in which it is to be stored, along with a name for the array The object directory of that quantum file is searched for a main object with the name specified
in the constructor; if it is found, the construction is complete Otherwise, a new object is created with one element, expandable to fill up the entire file system if
Trang 4To store an element in the array we have created, we can use StoreElement (Figure persist.03)
The PersistentArrayUlong::StoreElement function (from quantum\persist.cpp) (Figure persist.03)
codelist/persist.03
This routine first calculates which segment of the array contains the element we need to retrieve, and the element number within that segment Then, if we are running the "debugging" version (i.e., asserts are enabled), it checks whether the segment number is within the maximum range we set up when we created the array This test should never fail unless there is something wrong with the calling routine (or its callers), so that the element number passed in is absurdly large As discussed above, with all such conditional checks, we have to try to make sure that our testing is good enough to find any errors that might cause this to happen; with
a "release" version of the program, this would be a fatal error
Next, we check whether the segment number we need is already allocated to the array; if not, we increase the number of segments as needed by calling
GrowMainObject, but don't actually initialize any new segments until they're accessed, so that "sparse" arrays won't take up as much room as ones that are filled
in completely Next, we get a copy of the segment containing the element to be updated; if it's of zero length, that means we haven't initialized it yet, so we have to allocate memory for the new segment and fill it with zeros At this point, we are ready to create an AccessVector called TempUlongVector of type Ulong and use it as a "template" (no pun intended) to allow access to the element we want
to modify Since AccessVector has the semantics of an array, we can simply set the ElementNumberth element of TempUlongVector to the value of the input argument p_Element; the result of this is to place the new element value into the correct place in the TempVector array Finally, we store TempVector back into the main object, replacing the old copy of the segment
To retrieve an element from an array, we can use GetElement (Figure
persist.04)
The PersistentArrayUlong::GetElement function (from quantum\persist.cpp) (Figure persist.04)
codelist/persist.04
Trang 5First, we calculate the segment number and element number, and check (via
qfassert) whether the segment number is within the range of allocated
segments; if it isn't, we have committed the programming error of accessing an uninitialized value Assuming this test is passed, we retrieve the segment, set up the temporary Vector TempUlongVector to allow access to the segment as an SVector of Ulongs, and return the value from the ElementNumberth element
of the array All this is very well if we want to write things like
"Y.Put(100,100000L);" or "X = Y.Get(100);", to store or retrieve the 100th element of the Y "array", respectively But wouldn't it be much nicer to be able to write "Y[100] = 100000L;" or "X = Y[100];" instead?
In Resplendent Array
Clearly, that would be a big improvement in the syntax; as it happens, it's not hard
to make such references possible, with the addition of only a few lines of code.9 Unfortunately, this code is not the most straightforward, but the syntactic
improvement that it provides is worth the trouble The key is operator[ ] (Figure persist.05)
The PersistentArrayUlong::operator[ ] function (from quantum\persist.cpp) (Figure persist.05)
codelist/persist.05
This function returns a temporary value of a type that behaves differently in the context of an "lvalue" reference (i.e., a "write") than it does when referenced as an
"rvalue" (i.e, a "read") In order to follow how this process works, let's use the example in Figure persist1
Persistent array example (Figure persist1)
codelist/perexam.00
The first question to be answered is how the compiler decodes the following line: Save[1000000L] = 1234567L;
According to the definition of PersistentArrayUlong::operator[ ], this operator returns a PersistentArrayUlongRef that is constructed with the two parameters *this and p_Index, where the former is the
Trang 6Save), and the latter is the value inside the [ ], which in this case is 1000000L What is this return value? To answer this question, we have to look at the normal constructor for the PersistentArrayUlongRef class (Figure persist.06)