MIT Introduction to Algorithms Record-8

来源：互联网发布：卡洛斯实况巅峰数据编辑：程序博客网时间：2024/04/30 11:22

Weakness of hashing: For any choice of hash function, it always exist bad set of keys that all hash to same slot.
Idea: Choose hash function at random,independently from the keys.

Universal Hashing

Definition1: Let U be a universal of keys, and let H be a finite collection of hash functions,mapping U to {0,1,...,m−1}.

So H is universal:

if \forall x, y \in U, where x \neq y, then {h \in H, h (x) = h (y)} = | H | m

.
I.e. if

h is chosen randomly from

H, the probability of collision between

x and

y is

1x.

Theorem1:
Choose h randomly from H, suppose hashing n keys into m slots in Table T, then for a given key x,its expect number of collisions with x is:

E [# collisions with x] < n m

Proof:
Let cx be random variable, the total number collisions of keys in T with x, and let

c x y = {1, 0, if h (x) = h (y) otherwise

Note:

E[cxy]=1m and

Cx=∑y∈T−{x}cxy ,and y is the element not equal with x in Table T.

So:

E [C x] = E [\sum y \in T - {x} c x y] = \sum y \in T - {x} E [c x y] = \sum y \in T - {x} 1 m = n - 1 m

Constructing a universal hash function

Let m be prime, decompose key k into r+1 digits, so

k = ⟨ k 0, k 1, . . ., k r ⟩ where 0 \leq k i \leq m - 1

in here we treating

k as an

r+1 base

m number.
Now we’re going to pick an

a at random which we’re also going to look at as a base

m number, so pick

a=⟨a0,a1,...,ar⟩, each

ai is chosen randomly from

{0,1,...,m−1}, so the hash function as below:

Definition2:

h a (k) = (\sum i = 0 r a i k i) % m

We want to know how big is the set of hash function here? how many different hash functions do I have in this set?
Conclusion:

| H | = m r + 1

Explanation:
Because it have

m choices for each

ai (

0≤ai≤m−1), and for

a vector has

r+1 elements from

a=⟨a0,a1,...,ar⟩.

Theorem2: H is universal.

Proof:
Let x=⟨x0,x1,...,xr⟩ , y=⟨y0,y1,...,yr⟩ be distinct keys and they differ m at least one digit. They could differ in any one of these digits, in here we assume they differ in position 0.

Question: How many hash functions in universal ha(x)∈H do x and y collide?

It must have ha(x)=ha(y) if they collide.

\Rightarrow (\sum i = 0 r a i x i) % m \equiv (\sum i = 0 r a i y i) % m

\Rightarrow (\sum i = 0 r a i x i) \equiv (\sum i = 0 r a i y i) (% m) //do module m after everything is being done.

\Rightarrow \sum i = 0 r a i (x i - y i) \equiv 0 (% m)

\Rightarrow a 0 (x 0 - y 0) + \sum i = 1 r a i (x i - y i) \equiv 0 (% m)

\Rightarrow a 0 (x 0 - y 0) = - \sum i = 1 r a i (x i - y i) (% m)

since

x0≠y0, so

∃(x0−y0)−1 to make following formula is true according with Number Theory Fact.

a 0 = (- \sum i = 1 r a i (x i - y i)) (x 0 - y 0) - 1

Number Theory Fact:

let m be prime, for any z∈Zm (Zm are intergers module m), so for any z is not congruent to 0, there exists a unique z inverse in Zm, such that if I multiply z times the inverse, it produces something congruent to one mod m.
I.e.
$$ z \equiv̸ 0, \exists unique z - 1 \in Z m \Rightarrow $ z z - 1 \equiv 1 (mod m)$ .

Conclusion:
Thus, for any choice of a1,a2,...,ar exactly 1 of the m choices for a0 cause x and y to collide, and no collision for other m−1 choices for a0.
So the number of ha’s that cause x, y to collide:

\Rightarrow m * m . . . * m * 1 = m r = | H | m

because

a1 has

m choices, and

m choices for

a2 …, but only 1 choice for

a0 if want to cause collision.

Perfect Hashing

Situation: Given n keys construct a static hash table of size m=O(n), such that search takes O(1) time in the worst case.

Idea: Use a 2-level scheme with universal hashing at both levels. And the idea is that we’re going to do it in such a way that we have no collisions at level 2 and we’ll take any collides at level 1.
If ni items hash to level-1 slot i, then use mi=n2i slots in level-2 table si.

Level-2 Analysis

Theorem: Hash n keys into m=n2 slots using random h in universal H, we can get

E [# collisions] < 1 2

Proof: Probability 2 given keys collide under h is 1m=1n2.

E [# collisions] = (2 n) * 1 n 2

\Rightarrow = n ( n - 1 ) 2 1 n 2 < 1 2

Note:

(2n)=C2n.

Markov inequality

For randomly variable x≥0, Pr{x≥t} ≤ E[x]t.

Proof:

E [x] = \sum x = 0 \infty x * Pr{X = x} \geq \sum x = t \infty x * Pr{X = x}

\Rightarrow \geq \sum x = t \infty t * Pr{X = x} = t * Pr{X \geq t}

Corollary:

Pr{no collisions} \geq 1 2

Now we can use the Markov inequality theorem to prove that the corollary is correct.

Proof:

Pr{\geq 1 collisions} \leq E [ # c o l l i s i o n s ] 1 < 1 2

\Rightarrow Pr{no collisions} = 1 - Pr{\geq 1 collisions} \geq 1 2

Conclusion: So we can know that to find a good level-2 hash function, just test a few at random, and we’ll find one quickly since at least half will work.

Analysis of storage

For level-1 choose m=n, and let ni be the random variable for the number of keys that hash to slot i in table T, use mi=n2i slots in each level-2 table si, so

$E [total storage] = n + E [\sum i = 0 m - 1 θ (n 2 i)] = θ (n)$

Note: we can get E[∑m−1i=0θ(n2i)]=θ(n) by bucket sort analysis.

0 0