MIT Introduction to Algorithms Record-8

来源:互联网 发布:卡洛斯实况巅峰数据 编辑:程序博客网 时间:2024/04/30 11:22

Weakness of hashing: For any choice of hash function, it always exist bad set of keys that all hash to same slot.
Idea: Choose hash function at random,independently from the keys.


Universal Hashing

Definition1: Let U be a universal of keys, and let H be a finite collection of hash functions,mapping U to {0,1,...,m1}.

So H is universal:

if x,yU,where xy,then {hH,h(x)=h(y)}=|H|m
.
I.e. if h is chosen randomly from H, the probability of collision between x and y is 1x.

Theorem1:
Choose h randomly from H, suppose hashing n keys into m slots in Table T, then for a given key x,its expect number of collisions with x is:

E[#collisions with x]<nm

Proof:
Let cx be random variable, the total number collisions of keys in T with x, and let

cxy={1,0,if h(x)=h(y)otherwise

Note: E[cxy]=1m and Cx=yT{x}cxy ,and y is the element not equal with x in Table T.

So:

E[Cx]=E[yT{x}cxy]=yT{x}E[cxy]=yT{x}1m=n1m


Constructing a universal hash function

Let m be prime, decompose key k into r+1 digits, so

k=k0,k1,...,kr where 0kim1

in here we treating k as an r+1 base m number.
Now we’re going to pick an a at random which we’re also going to look at as a base m number, so pick a=a0,a1,...,ar, each ai is chosen randomly from {0,1,...,m1}, so the hash function as below:

Definition2:

ha(k)=(i=0raiki)%m

We want to know how big is the set of hash function here? how many different hash functions do I have in this set?
Conclusion:

|H|=mr+1

Explanation:
Because it have m choices for each ai (0aim1), and for a vector has r+1 elements from a=a0,a1,...,ar.

Theorem2: H is universal.

Proof:
Let x=x0,x1,...,xr , y=y0,y1,...,yr be distinct keys and they differ m at least one digit. They could differ in any one of these digits, in here we assume they differ in position 0.

Question: How many hash functions in universal ha(x)H do x and y collide?

It must have ha(x)=ha(y) if they collide.

(i=0raixi)%m(i=0raiyi)%m

(i=0raixi) (i=0raiyi)(%m)//do module m after everything is being done. 

i=0rai(xiyi)0(%m)

a0(x0y0)+i=1rai(xiyi)0(%m)

a0(x0y0)=i=1rai(xiyi)(%m)

since x0y0, so (x0y0)1 to make following formula is true according with Number Theory Fact.
a0=(i=1rai(xiyi))(x0y0)1

Number Theory Fact:

let m be prime, for any zZm (Zm are intergers module m), so for any z is not congruent to 0, there exists a unique z inverse in Zm, such that if I multiply z times the inverse, it produces something congruent to one mod m.
I.e.

$z≢0, unique z1Zm $zz11 (mod m)
.

Conclusion:
Thus, for any choice of a1,a2,...,ar exactly 1 of the m choices for a0 cause x and y to collide, and no collision for other m1 choices for a0.
So the number of ha’s that cause x, y to collide:

mm...m1=mr=|H|m

because a1 has m choices, and m choices for a2 …, but only 1 choice for a0 if want to cause collision.


Perfect Hashing

Situation: Given n keys construct a static hash table of size m=O(n), such that search takes O(1) time in the worst case.

Idea: Use a 2-level scheme with universal hashing at both levels. And the idea is that we’re going to do it in such a way that we have no collisions at level 2 and we’ll take any collides at level 1.
If ni items hash to level-1 slot i, then use mi=n2i slots in level-2 table si.


Level-2 Analysis

Theorem: Hash n keys into m=n2 slots using random h in universal H, we can get

E[#collisions]<12

Proof: Probability 2 given keys collide under h is 1m=1n2.

E[#collisions]=(2n)1n2

=n(n1)21n2<12

Note: (2n)=C2n.

Markov inequality

For randomly variable x0, Pr{xt E[x]t.

Proof:

E[x]=x=0xPr{X=x} x=txPr{X=x}

x=ttPr{X=x}=tPr{Xt}

Corollary:

Pr{no collisions}  12

Now we can use the Markov inequality theorem to prove that the corollary is correct.

Proof:

Pr{1 collisions}E[#collisions]1<12

Pr{no collisions}=1Pr{1 collisions}12

Conclusion: So we can know that to find a good level-2 hash function, just test a few at random, and we’ll find one quickly since at least half will work.

Analysis of storage

For level-1 choose m=n, and let ni be the random variable for the number of keys that hash to slot i in table T, use mi=n2i slots in each level-2 table si, so

E[total storage]=n+E[i=0m1θ(n2i)]=θ(n)

Note: we can get E[m1i=0θ(n2i)]=θ(n) by bucket sort analysis.

0 0
原创粉丝点击