Introduction to Algorithms Second Edition Instructor’s Manual 2nd phần 4 pot

Selection in worst-case linear time We can Þnd the ith smallest element in O n time in the worst case.. Start by getting a lower bound on the number of elements that are greater than the

Trang 1

• To complete this proof, we choose c such that

Therefore, we can determine any order statistic in linear time on average

Selection in worst-case linear time

We can Þnd the ith smallest element in O (n) time in the worst case We’ll describe

a procedure SELECTthat does so

SELECTrecursively partitions the input array

• Will use the deterministic procedure PARTITION, but with a small tion Instead of assuming that the last element of the subarray is the pivot, themodiÞed PARTITIONprocedure is told which element to use as the pivot.SELECTworks on an array of n > 1 elements It executes the following steps:

modiÞca-1 Divide the n elements into groups of 5 Get

exactly 5 elements and, if 5 does not divide n, one group with the remaining

n mod 5 elements.

2 Find the median of each of then/5 groups:

• Run insertion sort on each group Takes O (1) time per group since each

group has≤ 5 elements

• Then just pick the median from each group, in O (1) time.

3 Find the median x of the n/5 medians by a recursive call to SELECT (If

n/5 is even, then follow our convention and Þnd the lower median.)

4 Using the modiÞed version of PARTITIONthat takes the pivot element as input,

partition the input array around x Let x be the kth element of the array after partitioning, so that there are k− 1 elements on the low side of the partition and

n − k elements on the high side.

5 Now there are three possibilities:

• If i = k, just return x.

• If i < k, return the ith smallest element on the low side of the partition by

making a recursive call to SELECT

• If i > k, return the (i −k)th smallest element on the high side of the partition

by making a recursive call to SELECT

Trang 2

Start by getting a lower bound on the number of elements that are greater than the

partitioning element x:

x

[Each group is a column Each white circle is the median of a group, as found

in step 2 Arrows go from larger elements to smaller elements, based on what weknow after step 4 Elements in the region on the lower right are known to be greater

than x.]

• At least half of the medians found in step 2 are≥ x.

• Look at the groups containing these medians that are≥ x All of them

con-tribute 3 elements that are > x (the median of the group and the 2 elements

in the group greater than the group’s median), except for 2 of the groups: the

group containing x (which has only 2 elements > x) and the group with < 5

elements

• Forget about these 2 groups That leaves ≥

12

Symmetrically, the number of elements that are< x is at least 3n/10 − 6.

Therefore, when we call SELECTrecursively in step 5, it’s on ≤ 7n/10 + 6

ele-ments

Develop a recurrence for the worst-case running time of SELECT:

• Steps 1, 2, and 4 each take O (n) time:

• Step 1: making groups of 5 elements takes O (n) time.

• Step 2: sortingn/5 groups in O(1) time each.

• Step 4: partitioning the n-element array around x takes O (n) time.

• Step 3 takes time T (n/5).

• Step 5 takes time≤ T (7n/10 + 6), assuming that T (n) is monotonically

in-creasing

Trang 3

• Assume that T (n) = O(1) for small enough n We’ll use n < 140 as “small

enough.” Why 140? We’ll see why later

• Thus, we get the recurrence

T (n) ≤

T (n/5) + T (7n/10 + 6) + O(n) if n ≥ 140

Solve this recurrence by substitution:

• Inductive hypothesis: T (n) ≤ cn for some constant c and all n > 0.

• Assume that c is large enough that T (n) ≤ cn for all n < 140 So we are

concerned only with the case n ≥ 140

• Pick a constant a such that the function described by the O (n) term in the

recurrence is≤ an for all n > 0.

• Substitute the inductive hypothesis in the right-hand side of the recurrence:

• Because we assumed that n ≥ 140, we have n/(n − 70) ≤ 2.

• Thus, 20a ≥ 10a(n/(n−70)), so choosing c ≥ 20a gives c ≥ 10a(n/(n−70)), which in turn gives us the condition we need to show that T (n) ≤ cn.

• We conclude that T (n) = O(n), so that SELECTruns in linear time in all cases

• Why 140? We could have used any integer strictly greater than 70

• Observe that for n > 70, the fraction n/(n − 70) decreases as n increases.

• We picked n ≥ 140 so that the fraction would be ≤ 2, which is an easyconstant to work with

• We could have picked, say, n ≥ 71, so that for all n ≥ 71, the fraction would

be ≤ 71/(71 − 70) = 71 Then we would have had 20a ≥ 710a, so we’d have needed to choose c ≥ 710a.

Notice that SELECTand RANDOMIZED-SELECTdetermine information about therelative order of elements only by comparing elements

• Sorting requires(n lg n) time in the comparison model.

• Sorting algorithms that run in linear time need to make assumptions about theirinput

• Linear-time selection algorithms do not require any assumptions about their

input

• Linear-time selection algorithms solve the selection problem without sortingand therefore are not subject to the(n lg n) lower bound.

Trang 4

Medians and Order Statistics

Solution to Exercise 9.1-1

The smallest of n numbers can be found with n− 1 comparisons by conducting atournament as follows: Compare all the numbers in pairs Only the smaller of each

pair could possibly be the smallest of all n, so the problem has been reduced to that

of Þnding the smallest ofn/2 numbers Compare those numbers in pairs, and so

on, until there’s just one number left, which is the answer

To see that this algorithm does exactly n− 1 comparisons, notice that each numberexcept the smallest loses exactly once To show this more formally, draw a binary

tree of the comparisons the algorithm does The n numbers are the leaves, and each

number that came out smaller in a comparison is the parent of the two numbers thatwere compared Each non-leaf node of the tree represents a comparison, and there

are n − 1 internal nodes in an n-leaf full binary tree (see Exercise (B.5-3)), so exactly n− 1 comparisons are made

In the search for the smallest number, the second smallest number must have comeout smallest in every comparison made with it until it was eventually comparedwith the smallest So the second smallest is among the elements that were com-pared with the smallest during the tournament To Þnd it, conduct another tourna-ment (as above) to Þnd the smallest of these numbers At mostlg n (the height

of the tree of comparisons) elements were compared with the smallest, so Þndingthe smallest of these takeslg n − 1 comparisons in the worst case.

The total number of comparisons made in the two tournaments was

n − 1 + lg n − 1 = n + lg n − 2

in the worst case

For groups of 7, the algorithm still works in linear time The number of elements

greater than x (and similarly, the number less than x) is at least

4

12

Trang 5

and the recurrence becomes

T (n) ≤ T (n/7) + T (5n/7 + 8) + O(n) ,

which can be shown to be O (n) by substitution, as for the groups of 5 case in the

text

For groups of 3, however, the algorithm no longer works in linear time The number

of elements greater than x, and the number of elements less than x, is at least

2

12

which does not have a linear solution

We can prove that the worst-case time for groups of 3 is(n lg n) We do so by

deriving a recurrence for a particular case that takes(n lg n) time.

In counting up the number of elements greater than x (and similarly, the ber less than x), consider the particular case in which there are exactly1

num-2

n

3

groups with medians ≥ x and in which the “leftover” group does contribute 2 elements greater than x Then the number of elements greater than x is exactly

from which you can show that T (n) ≥ cn lg n by substitution You can also see

that T (n) is nonlinear by noticing that each level of the recursion tree sums to n.

[In fact, any odd group size≥ 5 works in linear time.]

A modiÞcation to quicksort that allows it to run in O (n lg n) time in the worst case

uses the deterministic PARTITION algorithm that was modiÞed to take an element

to partition around as an input parameter

SELECTtakes an array A, the bounds p and r of the subarray in A, and the rank i

of an order statistic, and in time linear in the size of the subarray A[ p r] it returns

the ith smallest element in A[ p r].

Trang 6

For an n-element array, the largest subarray that BEST-CASE-QUICKSORTrecurses

on has n /2 elements This situation occurs when n = r − p + 1 is even; then the

subarray A[q +1 r] has n/2 elements, and the subarray A[p q −1] has n/2−1

We assume that are given a procedure MEDIAN that takes as parameters an

ar-ray A and subarar-ray indices p and r, and returns the value of the median element of

A[ p r] in O(n) time in the worst case.

Given MEDIAN, here is a linear-time algorithm SELECTfor Þnding the ith est element in A[ p r] This algorithm uses the deterministic PARTITION algo-rithm that was modiÞed to take an element to partition around as an input parame-ter

then return SELECT (A, p, q − 1, i)

else return SELECT (A, q + 1, r, i − k)

Because x is the median of A[ p r], each of the subarrays A[p q − 1] and A[q + 1 r] has at most half the number of elements of A[p r] The recurrence

for the worst-case running time of SELECTis T (n) ≤ T (n/2) + O(n) = O(n).

Let’s start out by supposing that the median (the lower median, since we know we

have an even number of elements) is in X Let’s call the median value m, and let’s suppose that it’s in X[k] Then k elements of X are less than or equal to m and

n − k elements of X are greater than or equal to m We know that in the two arrays combined, there must be n elements less than or equal to m and n elements greater than or equal to m, and so there must be n − k elements of Y that are less than or equal to m and n − (n − k) = k elements of Y that are greater than or equal to m.

Trang 7

Thus, we can check that X[k] is the lower median by checking whether Y [n −k] ≤

X[k] ≤ Y [n − k + 1] A boundary case occurs for k = n Then n − k = 0, and there is no array entry Y [0]; we only need to check that X[n] ≤ Y [1].

Now, if the median is in X but is not in X[k], then the above condition will not hold If the median is in X[k], where k < k, then X[k] is above the median, and

Y [n − k + 1] < X[k] Conversely, if the median is in X[k], where k > k, then X[k] is below the median, and X[k] < Y [n − k].

Thus, we can use a binary search to determine whether there is an X[k] such that either k < n and Y [n−k] ≤ X[k] ≤ Y [n−k+1] or k = n and X[k] ≤ Y [n−k+1];

if we Þnd such an X[k], then it is the median Otherwise, we know that the median

is in Y , and we use a binary search to Þnd a Y [k] such that either k < n and X[n − k] ≤ Y [k] ≤ X[n − k + 1] or k = n and Y [k] ≤ X[n − k + 1]; such a

Y [k] is the median Since each binary search takes O (lg n) time, we spend a total

then return A[n]

elseif k < n and B[n − k] ≤ A[k] ≤ B[n − k + 1]

then return A[k]

elseif A[k] > B[n − k + 1]

then return FIND-MEDIAN(A, B, n, low, k − 1)

else return FIND-MEDIAN(A, B, n, k + 1, high)

In order to Þnd the optimal placement for Professor Olay’s pipeline, we need only

Þnd the median(s) of the y-coordinates of his oil wells, as the following proof

explains

Claim

The optimal y-coordinate for Professor Olay’s east-west oil pipeline is as follows:

• If n is even, then on either the oil well whose y-coordinate is the lower median

or the one whose y-coordinate is the upper median, or anywhere between them.

• If n is odd, then on the oil well whose y-coordinate is the median.

Trang 8

Proof We examine various cases In each case, we will start out with the pipeline

at a particular y-coordinate and see what happens when we move it We’ll denote

by s the sum of the north-south spurs with the pipeline at the starting location, and swill denote the sum after moving the pipeline

We start with the case in which n is even Let us start with the pipeline somewhere

on or between the two oil wells whose y-coordinates are the lower and upper dians If we move the pipeline by a vertical distance d without crossing either of the median wells, then n /2 of the wells become d farther from the pipeline and

me-n /2 become d closer, and so s = s + dn/2 − dn/2 = s; thus, all locations on or

between the two medians are equally good

Now suppose that the pipeline goes through the oil well whose y-coordinate is the upper median What happens when we increase the y-coordinate of the pipeline

by d > 0 units, so that it moves above the oil well that achieves the upper median?

All oil wells whose y-coordinates are at or below the upper median become d units farther from the pipeline, and there are at least n /2 + 1 such oil wells (the upper

median, and every well at or below the lower median) There are at most n /2 − 1

oil wells whose y-coordinates are above the upper median, and each of these oil wells becomes at most d units closer to the pipeline when it moves up Thus, we have a lower bound on s of s ≥ s + d(n/2 + 1) − d(n/2 − 1) = s + 2d > s.

We conclude that moving the pipeline up from the oil well at the upper medianincreases the total spur length A symmetric argument shows that if we start with

the pipeline going through the oil well whose y-coordinate is the lower median and

move it down, then the total spur length increases

We see, therefore, that when n is even, an optimal placement of the pipeline is

anywhere on or between the two medians

Now we consider the case when n is odd We start with the pipeline going through the oil well whose y-coordinate is the median, and we consider what happens when

we move it up by d > 0 units All oil wells at or below the median become d units

farther from the pipeline, and there are at least(n + 1)/2 such wells (the one at the

median and the(n − 1)/2 at or below the median There are at most (n − 1)/2 oil

wells above the median, and each of these becomes at most d units closer to the pipeline We get a lower bound on s of s ≥ s + d(n + 1)/2 − d(n − 1)/2 =

s + d > s, and we conclude that moving the pipeline up from the oil well at the

median increases the total spur length A symmetric argument shows that movingthe pipeline down from the median also increases the total spur length, and so theoptimal placement of the pipeline is on the median (claim)

Since we know we are looking for the median, we can use the linear-time Þnding algorithm

median-Solution to Problem 9-1

We assume that the numbers start out in an array

time (Don’t use quicksort or insertion sort, which can take(n2) time.) Put

Trang 9

the i largest elements (directly accessible in the sorted array) into the output

array, taking(i) time.

Total worst-case running time: (n lg n + i) = (n lg n) (because i ≤ n).

which takes (n) time, then call HEAP-EXTRACT-MAX i times to get the i

largest elements, in(i lg n) worst-case time, and store them in reverse order

of extraction in the output array The worst-case extraction time is (i lg n)

because

• i extractions from a heap with O (n) elements takes i · O(lg n) = O(i lg n)

time, and

• half of the i extractions are from a heap with ≥ n/2 elements, so those i/2

extractions take(i/2)(lg(n/2)) = (i lg n) time in the worst case.

Total worst-case running time: (n + i lg n).

time Partition around that number in(n) time Sort the i largest numbers in

(i lg i) worst-case time (with merge sort or heapsort).

Total worst-case running time: (n + i lg i).

Note that method (c) is always asymptotically at least as good as the other twomethods, and that method (b) is asymptotically at least as good as (a) (Com-paring (c) to (b) is easy, but it is less obvious how to compare (c) and (b) to (a)

(c) and (b) are asymptotically at least as good as (a) because n, i lg i, and i lg n are all O (n lg n) The sum of two things that are O(n lg n) is also O(n lg n).)

Solution to Problem 9-2

a The median x of the elements x1, x2, , x n , is an element x = x k satisfying

|{x i : 1≤ i ≤ n and x i < x}| ≤ n/2 and |{x i : 1≤ i ≤ n and x i > x}| ≤ n/2.

If each element x i is assigned a weightw i = 1/n, then we get

Trang 10

2 ,

which proves that x is also the weighted median of x1, x2, , x nwith weights

w i = 1/n, for i = 1, 2, , n.

the array of sorted x i’s, starting with the smallest element and accumulatingweights as we scan, until the total exceeds 1/2 The last element, say x k, whoseweight caused the total to exceed 1/2, is the weighted median Notice that the

total weight of all elements smaller than x k is less than 1/2, because x k wasthe Þrst element that caused the total weight to exceed 1/2 Similarly, the total

weight of all elements larger than x k is also less than 1/2, because the total

weight of all the other elements exceeds 1/2.

The sorting phase can be done in O (n lg n) worst-case time (using merge sort

or heapsort), and the scanning phase takes O (n) time The total running time

in the worst case, therefore, is O (n lg n).

worst-case median algorithm in Section 9.3 (Although the Þrst paragraph of the

section only claims an O (n) upper bound, it is easy to see that the more precise

running time of(n) applies as well, since steps 1, 2, and 4 of SELECTactuallytake(n) time.)

The weighted-median algorithm works as follows If n ≤ 2, we just returnthe brute-force solution Otherwise, we proceed as follows We Þnd the actual

median x k of the n elements and then partition around it We then compute the

total weights of the two halves If the weights of the two halves are each strictlyless than 1/2, then the weighted median is x k Otherwise, the weighted medianshould be in the half with total weight exceeding 1/2 The total weight of the

“light” half is lumped into the weight of x k, and the search continues within thehalf that weighs more than 1/2 Here’s pseudocode, which takes as input a set

X = {x1, x2, , x n}:

Trang 11

The recurrence for the worst-case running time of WEIGHTED-MEDIAN is

T (n) = T (n/2 + 1) + (n), since there is at most one recursive call on half the

number of elements, plus the median element x k, and all the work preceding therecursive call takes(n) time The solution of the recurrence is T (n) = (n).

d Let the n points be denoted by their coordinates x1, x2, , x n, let the sponding weights bew1, w2, , w n , and let x = x k be the weighted median

corre-For any point p, let f (p) =n

i=1w i |p − x i |; we want to Þnd a point p such that f (p) is minimum Let y be any point (real number) other than x We show

the optimality of the weighted median x by showing that f (y) − f (x) ≥ 0 We

examine separately the cases in which y > x and x > y For any x and y, we

Trang 12

Separating out the Þrst two cases, in which x < x i, from the third case, in which

The property that

x i <x w i < 1/2 implies thatx ≥xi w i ≥ 1/2 This fact, combined with y − x > 0 andx<x i w i ≤ 1/2, yields that f (y) − f (x) ≥ 0 When x > y, we again bound the quantity |y − x i | − |x − x i| from below byexamining three cases:

1 x i ≤ y < x: Here, |y − x i | + |x − y| = |x − x i | and |x − y| = x − y, which

The property that

x i >x w i ≤ 1/2 implies that x ≤xi w i > 1/2 This fact,

combined with x − y > 0 andx >x i w i < 1/2, yields that f (y) − f (x) > 0.

e We are given n 2-dimensional points p1, p2, , p n , where each p i is a pair of

real numbers p i = (x i , y i ), and positive weights w1, w2, , w n The goal is

to Þnd a point p = (x, y) that minimizes the sum

We can express the cost function of the two variables, f (x, y), as the sum of

two functions of one variable each: f (x, y) = g(x) + h(y), where g(x) =

n

i=1w i |x − x i |, and h(y) = n

i=1w i |y − y i| The goal of Þnding a point

p = (x, y) that minimizes the value of f (x, y) can be achieved by treating each dimension independently, because g does not depend on y and h does not depend on x Thus,

min

x,y f (x, y) = min x,y (g(x) + h(y))

Trang 13

= min

x

min

Consequently, Þnding the best location in 2 dimensions can be done by Þnding

the weighted median x k of the x-coordinates and then Þnding the weighted median y j of the y-coordinates The point (x k , y j ) is an optimal solution for

the 2-dimensional post-ofÞce location problem

Solution to Problem 9-3

a Our algorithm relies on a particular property of SELECT: that not only does it

return the ith smallest element, but that it also partitions the input array so that the Þrst i positions contain the i smallest elements (though not necessarily in

sorted order) To see that SELECThas this property, observe that there are only

two ways in which returns a value: when n = 1, and when immediately after

partitioning in step 4, it Þnds that there are exactly i elements on the low side

of the partition

Taking the hint from the book, here is our modiÞed algorithm to select the ith smallest element of n elements Whenever it is called with i ≥ n/2, it just calls

SELECTand returns its result; in this case, U i (n) = T (n).

When i < n/2, our modiÞed algorithm works as follows Assume that the input

is in a subarray A[ p

1 Divide the input as follows If n is even, divide the input into two parts:

A[ p + 1 p + m] and A[p + m + 1 p + n] If n is odd, divide the input into three parts: A[ p + 1 p + m], A[p + m + 1 p + n − 1], and A[p + n]

as a leftover piece

2 Compare A[ p +i] and A[p +i + m] for i = 1, 2, , m, putting the smaller

of the the two elements into A[ p + i + m] and the larger into A[p + i].

3 Recursively Þnd the ith smallest element in A[ p +m+1 p+n], but with an

additional action performed by the partitioning procedure: whenever it

ex-changes A[ j ] and A[k] (where p +m+1 ≤ j, k ≤ p+2m), it also exchanges

A[ j −m] and A[k−m] The idea is that after recursively Þnding the ith est element in A[ p +m +1 p +n], the subarray A[p + m + 1 p +m +i] contains the i smallest elements that had been in A[ p + m + 1 p + n] and the subarray A[ p + 1 p + i] contains their larger counterparts, as found in step 1 The ith smallest element of A[ p + 1 p + n] must be either one of the i smallest, as placed into A[ p + m + 1 p + m + i], or it must be one of the larger counterparts, as placed into A[ p + 1 p + i].

small-4 Collect the subarrays A[ p + 1 p + i] and A[p + m + 1 p + m + i] into

a single array B[1 2i], call SELECT to Þnd the ith smallest element of B,

and return the result of this call to SELECT

The number of comparisons in each step is as follows:

Trang 14

b We show by substitution that if i < n/2, then U i (n) = n+O(T (2i) lg(n/i)) In

particular, we shall show that U i (n) ≤ n + cT (2i) lg(n/i) − d(lg lg n)T (2i) =

n + cT (2i) lg n − cT (2i) lg i − d(lg lg n)T (2i) for some positive constant c, some positive constant d to be chosen later, and n ≥ 4 We have

− d(lg lg n/2)T (2i)

= n + cT (2i) lg n/2 − cT (2i) lg i − d(lg lgn/2)T (2i)

≤ n + cT (2i) lg(n/2 + 1) − cT (2i) lg i − d(lg lg(n/2))T (2i)

= n + cT (2i) lg(n/2 + 1) − cT (2i) lg i − d(lg(lg n − 1))T (2i)

≤ n + cT (2i) lg n − cT (2i) lg i − d(lg lg n)T (2i)

if cT (2i) lg(n/2 + 1) − d(lg(lg n − 1))T (2i) ≤ cT (2i) lg n − d(lg lg n)T (2i).

Simple algebraic manipulations gives the following sequence of equivalent ditions:

con-cT (2i) lg(n/2 + 1) − d(lg(lg n − 1))T (2i) ≤ cT (2i) lg n − d(lg lg n)T (2i)

Observe that 1/2+1/n decreases as n increases, but (lg n−1)/ lg n increases as

n increases When n = 4, we have 1/2+1/n = 3/4 and (lg n −1)/ lg n = 1/2 Thus, we just need to choose d such that c lg (3/4) ≤ d lg(1/2) or, equivalently,

c lg (3/4) ≤ −d Multiplying both sides by −1, we get d ≤ −c lg(3/4) =

c lg(4/3) Thus, any value of d that is at most c lg(4/3) sufÞces.

when i is a constant less than n /2, we have that

U i (n) = n + O(T (2i) lg(n/i))

Trang 16

Hash Tables

Chapter 11 overview

Many applications require a dynamic set that supports only the dictionary

A hash table is effective for implementing a dictionary

• The expected time to search for an element in a hash table is O (1), under some

reasonable assumptions

• Worst-case search time is(n), however.

A hash table is a generalization of an ordinary array

• With an ordinary array, we store the element whose key is k in position k of the

array

• Given a key k, we Þnd the element whose key is k by just looking in the kth

position of the array This is called direct addressing.

• Direct addressing is applicable when we can afford to allocate an array withone position for every possible key

We use a hash table when we do not want to (or cannot) allocate an array with oneposition per possible key

• Use a hash table when the number of keys actually stored is small relative tothe number of possible keys

• A hash table is an array, but it typically uses a size proportional to the number

of keys to be stored (rather than the number of possible keys)

• Given a key k, don’t just use k as the index into the array.

• Instead, compute a function of k, and use that value to index into the array We

call this function a hash function.

Issues that we’ll explore in hash tables:

• How to compute hash functions We’ll look at the multiplication and divisionmethods

• What to do when the hash function maps multiple keys to the same table entry.We’ll look at chaining and open addressing

Trang 17

Direct-address tables

Scenario:

• Maintain a dynamic set

• Each element has a key drawn from a universe U = {0, 1, , m − 1} where

m isn’t too large.

• No two elements have the same key

Represent by a direct-address table, or array, T [0 m − 1]:

• Each slot, or position, corresponds to a key in U

• If there’s an element x with key k, then T [k] contains a pointer to x.

• Otherwise, T [k] is empty, represented byNIL.

2 3 5 8

1

9 4

0 7

3 5

8

key satellite data

2

0 1

3 4 5 6 7 8 9

Dictionary operations are trivial and take O (1) time each:

The problem with direct addressing is if the universe U is large, storing a table of

size|U| may be impractical or impossible.

Often, the set K of keys actually stored is small, compared to U , so that most of the space allocated for T is wasted.

Trang 18

• When K is much smaller than U , a hash table requires much less space than a

direct-address table

• Can reduce storage requirements to(|K|).

• Can still get O (1) search time, but in the average case, not the worst case.

Idea: Instead of storing an element with key k in slot k, use a function h and store

the element in slot h (k).

• We call h a hash function.

• h : U → {0, 1, , m − 1}, so that h(k) is a legal slot number in T

• We say that k hashes to slot h (k).

Collisions: When two or more keys hash to the same slot.

• Can happen when there are more possible keys than slots (|U| > m).

• For a given set K of keys with |K| ≤ m, may or may not happen DeÞnitely

happens if|K| > m.

• Therefore, must be prepared to handle collisions in all cases

• Use two methods: chaining and open addressing

• Chaining is usually better than open addressing We’ll examine both

Collision resolution by chaining

Put all elements that hash to the same slot into a linked list

• Slot j contains a pointer to the head of the list of all stored elements that hash

to j [or to the sentinel if using a circular, doubly linked list with a sentinel] ,

• If there are no such elements, slot j containsNIL.

Trang 19

How to implement dictionary operations with chaining:

CHAINED-HASH-INSERT(T, x)

insert x at the head of list T [h (key[x])]

• Worst-case running time is O (1).

• Assumes that the element being inserted isn’t already in the list

• It would take an additional search to check if it was already inserted

CHAINED-HASH-SEARCH(T, k)

search for an element with key k in list T [h (k)]

Running time is proportional to the length of the list of elements in slot h (k).

CHAINED-HASH-DELETE(T, x)

delete x from the list T [h (key[x])]

• Given pointer x to the element to delete, so no search is needed to Þnd this

element

• Worst-case running time is O (1) time if the lists are doubly linked.

• If the lists are singly linked, then deletion takes as long as searching,

be-cause we must Þnd x’s predecessor in its list in order to correctly update next

pointers

Analysis of hashing with chaining

Given a key, how long does it take to Þnd an element with that key, or to determinethat there is no element with that key?

• Analysis is in terms of the load factor α = n/m:

• n = # of elements in the table

• m = # of slots in the table = # of (possibly empty) linked lists

• Load factor is average number of elements per linked list

• Can haveα < 1, α = 1, or α > 1.

• Worst case is when all n keys hash to the same slot ⇒ get a single list of length n

⇒ worst-case time to search is (n), plus time to compute hash function.

• Average case depends on how well the hash function distributes the keys amongthe slots

We focus on average-case performance of hashing with chaining

• Assume simple uniform hashing: any given element is equally likely to hash

into any of the m slots.

Trang 20

• For j = 0, 1, , m − 1, denote the length of list T [ j] by n j Then

n = n0+ n1+ · · · + n m−1

• Average value of n j is E [n j]= α = n/m.

• Assume that we can compute the hash function in O (1) time, so that the time

required to search for the element with key k depends on the length n h (k)of the

list T [h (k)].

We consider two cases:

• If the hash table contains no element with key k, then the search is unsuccessful.

• If the hash table does contain an element with key k, then the search is

An unsuccessful search takes expected time(1 + α).

to hash to any of the m slots.

To search unsuccessfully for any key k, need to search to the end of the list T [h (k)].

This list has expected length E[n h (k)] = α Therefore, the expected number of

elements examined in an unsuccessful search isα.

Adding in the time to compute the hash function, the total time required is

(1 + α).

Successful search:

• The expected time for a successful search is also(1 + α).

• The circumstances are slightly different from an unsuccessful search

• The probability that each list is searched is proportional to the number of ments it contains

ele-Theorem

A successful search takes expected time(1 + α).

Proof Assume that the element x being searched for is equally likely to be any of

the n elements stored in the table.

The number of elements examined during a successful search for x is 1 more than the number of elements that appear before x in x’s list These are the elements inserted after x was inserted (because we insert at the head of the list).

So we need to Þnd the average, over the n elements x in the table, of how many elements were inserted into x’s list after x was inserted.

For i = 1, 2, , n, let x i be the ith element inserted into the table, and let

k i = key[x i]

Trang 21

For all i and j , deÞne indicator random variable X i j = I {h(k i ) = h(k j )}.

Simple uniform hashing ⇒ Pr{h(k i ) = h(k j )} = 1/m ⇒ E [X i j] = 1/m (by

Alternative analysis, using indicator random variables even more:

For each slot l and for each pair of keys k i and k j, deÞne the indicator random

variable X i jl = I {the search is for x i , h (k i ) = l, and h(k j ) = l} X i jl = 1 when

keys k i and k j collide at slot l and when we are searching for x i

Simple uniform hashing ⇒ Pr{h(k i ) = l} = 1/m and Pr {h(k j ) = l} = 1/m.

Also have Pr{the search is for x i } = 1/n These events are all independent ⇒

Pr{X i jl = 1} = 1/nm2⇒ E [X i jl]= 1/nm2(by Lemma 5.1)

DeÞne, for each element x j, the indicator random variable

Y j = I {x j appears in a list prior to the element being searched for}

Y j = 1 if and only if there is some slot l that has both elements x i and x j in its list,

and also i < j (so that x i appears after x j in the list) Therefore,

Định dạng
Số trang	43
Dung lượng	282,82 KB