Polyphase Merging One problem with balanced multiway merging for tape sorting is that it requires either an excessive number of tape units or excessive copying.. For P-way merging either
Trang 1EXTERNAL SORTING 163
exactly after the sort phase is completed ) The best choice between these two alternatives of the lowest reasonable value of P and the highest reasonable
value of P is obviously very dependent on many systems parameters: both
alternatives (and some in between) should be considered
Polyphase Merging
One problem with balanced multiway merging for tape sorting is that it requires either an excessive number of tape units or excessive copying For P-way merging either we must use 2P t lpes (P for input and P for output)
or we must copy almost all of the file from a single output tape to P input
tapes between merging passes, which effectively doubles the number of passes
to be about 21og,(N/2M) S everal clevl:r tape-sorting algorithms have been invented which eliminate virtually all of this copying by changing the way in which the small sorted blocks are merged together The most prominent of these methods is called polyphase mergir;g.
The basic idea behind polyphase merging is to distribute the sorted blocks produced by replacement selection somewhat unevenly among the available tape units (leaving one empty) and thc:n to apply a “merge until empty” strategy, at which point one of the output tapes and the input, tape switch roles
For example, suppose that we have just three tapes, and we start out with the following initial configuration of sorted blocks on the tapes (This comes from applying replacement selection to our example file with an internal memory that can only hold two records.:
Tape I : A 0 R S T I N A G N D E M R G I N Tape,2:EGX A M P E L
Tape 3:
After three 2-way merges from tape3 1 and 2 to tape 3, the second tape becomes empty and we are left with the configuration:
T a p e l : D E M R G I N
Tape 2:
TapeS:AEGOR STX A I M N P A E G L N
Then, after two 2-way merges from tapes 1 and 3 to tape 2, the first tape becomes empty, leaving:
Tape 1:
TapeZ:ADEEGMORRSTX A G I I M N N P Tape3:AEGLN
Trang 2164 CHAPTER 13
The sort is completed in two more steps First, a two-way merge from tapes 2 and 3 to tape 1 leaves one file on tape 2, one file on tape 1 Then a twoway merge from tapes 1 and 2 to tape 3 leaves the entire sorted file on tape 3
This “merge until empty” strategy can be extended to work for an ar-bitrary number of tapes For example, if we have four tape units Tl, T2, T3, and T4 and we start out with Tl being the output tape, T2 having 13 initial runs, T3 having 11 initial runs, and T4 having 7 initial runs, then after running a 3-way “merge until empty,” we have T4 empty, Tl with 7 (long) runs, T2 with 6 runs, and T3 with 4 runs At this point, we can rewind
Tl and make it an input tape, and rewind T4 and make it an output tape Continuing in this way, we eventually get the whole sorted file onto Tl:
Tl T2 T3 T4
7 6 4 0
3 2 0 4
1 0 2 2
0 1 1 1
1 0 0 0
The merge is broken up into many phases which don’t involve all the data,
but no direct copying is involved
The main difficulty in implementing a polyphase merge is to determine how to distribute the initial runs It is not difficult to see how to build the table above by working backwards: take the largest number on each line, make
it zero, and add it to each of the other numbers to get the previous line This corresponds to defining the highest-order merge for the previous line which could give the present line This technique works for any number of tapes (at least three): the numbers which arise are “generalized Fibonacci numbers” which have many interesting properties Of course, the number of initial runs may not be known in advance, and it probably won’t be exactly a generalized Fibonacci number Thus a number of “dummy” runs must be added to make the number of initial runs exactly what is needed for the table
The analysis of polyphase merging is complicated, interesting, and yields surprising results For example, it turns out that the very best method for distributing dummy runs among the tapes involves using extra phases and more dummy runs than would seem to be needed The reason for this is that some runs are used in merges much more often than others
Trang 3EXTERNAL SORTING 165
There are many other factors to be t&ken into consideration in implement-ing a most efficient tape-sortimplement-ing method For example, a major factor which
we have not considered at all is the timt: that it takes to rewind a tape This subject has been studied extensively, ant many fascinating methods have been defined However, as mentioned above, the savings achievable over the simple multiway balanced merge are quite limited Even polyphase merging is only better than balanced merging for small P, and then not substantially For
P > 8, balanced merging is likely to run j’aster than polyphase, and for smaller
P the effect of polyphase is basically to sue two tapes (a balanced merge with
two extra tapes will run faster)
An Easier Way
Many modern computer systems provide a large virtual memory capability
which should not be overlooked in imp ementing a method for sorting very large files In a good virtual memory syf#tem, the programmer has the ability
to address a very large amount of data, leaving to the system the responsibility
of making sure that addressed data is Lransferred from external to internal storage when needed This strategy relict on the fact that many programs have
a relatively small “locality of reference” : each reference to memory is likely to
be to an area of memory that is relatively close to other recently referenced areas This implies that transfers from e:rternal to internal storage are needed infrequently An int,ernal sorting method with a small locality of reference can work very well on a virtual memory system (For example, Quicksort has two
“localities” : most references are near one of the two partitioning pointers.) But check with your systems programmclr before trying it on a very large file:
a method such as radix sorting, which hE,s no locality of reference whatsoever, would be disastrous on a virtual memory system, and even Quicksort could cause problems, depending on how well the available virtual memory system
is implemented On the other hand, th’: strategy of using a simple internal sorting method for sorting disk files desl:rves serious consideration in a good virtual memorv environment
Trang 4Exercises
1 Describe how you would do external selection: find the kth largest in a
file of N elements, where N is much too large for the file to fit in main memory
2 Implement the replacement selection algorithm, then use it to test the claim that the runs produced are about twice the internal memory size
3 What is the worst that can happen when replacement selection is used to
produce initial runs in a file of N records, using a priority queue of size
M, with M < N
4 How would you sort the contents of a disk if no other storage (except main memory) were available for use?
5 How would you sort the contents of a disk if only one tape (and main memory) were available for use?
6 Compare the 4-tape and 6-tape multiway balanced merge to polyphase merge with the same number of tapes, for 31 initial runs
7 How many phases does 5-tape polyphase merge use when started up with four tapes containing 26,15,22,28 runs?
8 Suppose the 31 initial runs in a 4-tape polyphase merge are each one record long (distributed 0, 13, 11, 7 initially) How many records are there in each of the files involved in the last three-way merge?
9 How should small files be handled in a Quicksort implementation to be run on a very large file within a virtual memory environment?
10 How would you organize an external priority queue? (Specifically, design
a way to support the insert and remove operations of Chapter 11, when the number of elements in the priority queue could grow to be much to large for the queue to fit in main memory.)
Trang 5SOURCES for Sorting
The primary reference for this section is volume three of D E Knuth’s series on sorting and searching Further information on virtually every topic that we’ve touched upon can be found in that book In particular, the results that we’ve quoted on performance chal,acteristics of the various algorithms are backed up by complete mathematic:tl analyses in Knuth’s book
There is a vast amount of literatllre on sorting Knuth and Rivest’s
1973 bibliography contains hundreds of entries, and this doesn’t include the treatment of sorting in countless books ind articles on other subjects (not to mention work since 1973)
For Quicksort, the best reference is Hoare’s original 1962 paper, which suggests all the important variants, including the use for selection discussed
in Chapter 12 Many more details on the mathematical analysis and the practical effects of many of the modifications and embellishments which have been suggested over the years may be fat nd in this author’s 1975 Ph.D thesis
A good example of an advanced priority queue structure, as mentioned in Chapter 11, is J Vuillemin’s “binomial cueues” as implemented and analyzed
by M R Brown This data structure supports all of the priority queue operations in an elegant and efficient manner
To get an impression of the myriall details of reducing algorithms like those we have discussed to general-purpoire practical implementations, a reader would be advised to study the reference material for his particular computer system’s sort utility Such material necef sarily deals primarily with formats of keys, records and files as well as many other details, and it is often interesting
to identify how the algorithms themselv:s are brought into play
M R Brown, “Implementation and am.lysis of binomial queue algorithms,”
SIAM Journal of Computing, 7, 3, (August, 1978).
C A R Hoare, “Quicksort,” Computer Journal, 5, 1 (1962).
D E Knuth, The Art of Computer Programming Volume S: Sorting and
Searching, Addison-Wesley, Reading, M9, second printing, 1975
R L Rivest and D E Knuth, “BibliogIaphy 26: Computing Sorting,” Com-puting Reviews, 13, 6 (June, 1972).
R Sedgewick, Quicksort, Garland, New York, 1978 (Also appeared as the author’s Ph.D dissertation, Stanford University, 1975)
Trang 7I
!t- Ii
Trang 914 Elementary Searching Methods
A fundamental operation intrinsic ;o a great many computational tasks
is searching: retrieving some partic-liar information from a large amount
of previously stored information Normally we think of the information as divided up into records, each record haling a key for use in searching The
goal of the search is to find all records with keys matching a given search key.
The purpose of the search is usually to ;1ccess information within the record (not merely the key) for processing
Two common terms often used to describe data structures for searching are dictionaries and symbol tables For example, in an English language
dic-tionary, the “keys” are the words and the “records” the entries associated with the words which contain the definition, pronunciation, and other associated in-formation (One can prepare for learning and appreciating searching methods
by thinking about how one would implenent a system allowing access to an English language dictionary.) A symbol table is the dictionary for a program: the “keys” a-e the symbolic names used in the program, and the “records” contain information describing the objet t named
In searching (as in sorting) we havt: programs which are in widespread use on a very frequent basis, so that it vrill be worthwhile to study a variety
of methods in some detail As with sorling, we’ll begin by looking at some elementary methods which are very useful for small tables and in other special situations and illustrate fundamental techniques exploited by more advanced methods We’ll look at methods which stelre records in arrays which are either searched with key comparisons or index:d by key value, and we’ll look at a fundamental method which builds structures defined by the key values
As with priority queues, it is best to think of search algorithms as belong-ing to packages implementbelong-ing a variety of generic operations which can be separated from particular implementations, so that alternate implementations could be substituted easily The operations of interest include:
171
Trang 10172 CHAPTER 14
Initialize the data structure.
Search for a record (or records) having a given key.
Insert a new record
Delete a specified record.
Join two dictionaries to make a large one.
Sort the dictionary; output all the records in sorted order.
As with priority queues, it is sometimes convenient to combine some of these operations For example, a search and insert operation is often included for
efficiency in situations where records with duplicate keys are not to be kept within the data structure In many methods, once it has been determined that a key does not appear in the data structure, then the internal state of the search procedure contains precisely the information needed to insert a new record with the given key
Records with duplicate keys can be handled in one of several ways, depending on the application First, we could insist that the primary searching data structure contain only records with distinct keys Then each “record” in this data structure might contain, for example, a link to a list of all records having that key This is the most convenient arrangement from the point
of view of the design of searching algorithms, and it is convenient in some applications since all records with a given search key are returned with one
search The second possibility is to leave records with equal keys in the
primary searching data structure and return any record with the given key for a search This is simpler for applications that process one record at a time, where the order in which records with duplicate keys are processed is not important It is inconvenient from the algorithm design point of view because some mechanism for retrieving all records with a given key must still
be provided A third possibility is to assume that each record has a unique identifier (apart from the key), and require that a search find the record with
a given identifier, given the key Or, some more complicated mechanism could
be used to distinguish among records with equal keys
Each of the fundamental operations listed above has important applica-tions, and quite a large number of basic organizations have been suggested to support efficient use of various combinations of the operations In this and the next few chapters, we’ll concentrate on implementations of the fundamental functions search and insert (and, of course, initialize), with some comment on delete and sort when appropriate As with priority queues, the join operation
normally requires advanced techniques which we won’t be able to consider here
Sequential Searching
The simplest method for searching is simply to store the records in an array,