hash

来源：互联网发布：中国银河证券软件下载编辑：程序博客网时间：2024/05/04 21:39

Selecting a Hash Function

一个好的哈希函数就是近似于正态哈希映射，将元素以正态随机分布到hash table中。

h (k) = x

其中，

x称为

k的 hash coding.

通常，大部分hash方法会假定k为一个整数，当k不是整数时，可以将其转换为整数。

### Division method
$h (k) = k m o d m$
需要避免m取值为2的指数，这是因为如果m=2p,哈希函数就变成了k的余下的bits值。通常选用的m是一个不太靠近2的幂的素数。

For example, if we expect to insert around n = 4500 elements into a chained hash
table, we might choose m = 1699, a good prime number between 210 and 211.
This results in a load factor of α = 4500/1699 ≈ 2.6, which indicates that generally
two or three elements will reside in each bucket, assuming uniform hashing.
### Multiplication method
$h (k) = ⌊ m (k A m o d 1) ⌋, A \approx 5 \sqrt - 1 2 \approx 0.618$
也就是取(kA mod 1)的小数部分与m相乘。

y=⌊x⌋,指y取不超过x的最大整数
这里m为hash table的桶数。
For example, if the table contains m = 2000 positions, and we hash the key k = 6341,the hash coding is

$⌊ (2000) ((6341) (0.618) m o d 1 ⌋ = ⌊ (2000) (3918.738 m o d 1) ⌋ = ⌊ (2000) (0.738) ⌋ = 1476$ .

Type of Hash Table

Chained hash table

Open-addressed hash table

open-addressed hash table
所装载的元素必须小于table中的位置数(n>m),因此其装载因子始终小于或等于1

concept

@ load factor

$α = n m$
其中，n 为待装载的元素个数，m为hash coding集合中的元素个数。

probe

解决open-addressed hash table冲突的方法就是探测hash table.为了插入一个元素，首先需要进行位置探测，直到找到一个空位置，然后插入空位置上。为了移除或查找某个元素，也首先进行探测，直到找到元素或遇到一个空位置。如果在找到元素之前遇到一个空位置，或者已经遍历了所有位置，则该元素不存在。

goal

主要目标是减少探测次数。确切来讲，需要进行多少次探测主要取决于2个因素
- load factor(负载系数)
- the degree to which elements are distributed uniformly(元素均匀分布的程度)

假设hash是正态分布，那么需要探测的位置的个数为

$1 1 - α$

Load Factor(%) Expexted Probes < 50 < 1 / (1 - 0.5) =2 80 1 / (1-0.8) = 5 90 1 / (1-0.9) = 10 95 1 / (1-0.95) = 20

实际应用中，逼近表中的结果取决于所选择的逼近uniform hashing的程度，即依赖于我们所选择的hash函数。然而，在open-addressed hash table中，它也取决于我们如何在碰撞发生时探测表中的后续位置。

Generally, a hash function for probing positions in an open-addressed hash table is defined by:

$h (k, i) = x$
where k is a key, i is the number of times the table has been probed thus far, and x is the resulting hash coding.

一种最有效的方法，对open-addressed hash table,就是两个hash函数相加。

$h (k, i) = (h 1 (k) + i h 2 (k)) m o d m$
函数h1(k),h2(k)都是辅助hash函数，其选择的方法与其他hash函数相同，尽可能保证元素通过hash映射为正态随机分布。但是，为了确保在所有位置被访问两次之前，所有位置都已经被访问，必须遵守以下过程：
- 选择m为2的幂，然后保证h2始终返回奇数。
- 选择m为素数，设计h2使得它总返回一个小于m的正数
典型地，h1(k)=k mod m,而且h2(k)=1+(k mod m′),m′略小于m，或m−1,m−2.

for example, if the hash table
contains m = 1699 positions (a prime number) and we hash the key k = 15,385, the positions probed are (94 + (0)(113)) mod 1699 = 94 when i = 0, and every 113th
position after this as i increases.

The advantage of double hashing is that it is one of the best forms of probing,
producing a good distribution of elements throughout a hash table.
The disadvantage is that m is constrained in order to ensure that all positions in the table will be visited in a series of probes before any position is probed twice.

0 0