【Data Structures】 10. Hashing—Mission Possible

来源：互联网发布：linux编辑文件不保存编辑：程序博客网时间：2024/06/14 12:52

Converting words to numbers, more specifically integers.

Workaround 1: Open Addressing (mainly linear probing)

Step size:

In linearing probing, the step size is always 1 that means the probe goes to x, x+1, x+2, x+3 and so on.

Clustering:

A sequence of filled cells in a hash table that is long.

As a hash table becomes more and more full, the clusters grow larger and larger.

- When the hash table is half full, the performance is still not bad.

- However, it is proven that, when it is beyond two-thirds full, the performance degrades seriously.

It is critical to ensure that a hash table never becomes full (Ideally, more than two-thirds full).

Load factor

It is the ratio of the number of data items in a hash table to the length of the array.

In linear probing, search time becomes really slow as the load factor approaches 1.

How to solve it?

Rehashing

First, it is necessary to create a new array that is mostly twice bigger than the old array length but it depends on the load factor you provided.

The hash method then calculates the location of a given data item based on the new array length.

Second, we need to go through the old array, cell by cell, and insert them by calling hash function over and over.

It's a time-consuming process.

In Open Addressing, there are two other major collision resolution mechanisms: Quadratic Probing and Double Hashing.

Workaround 2: Separate Chaining

In open addressing, collisions are resolved by looking for an open cell in the hash table.

Another approach is to put a linked list at each index in the hash table.

In separate chaining, it is normal to put n or more items in an array of length n.

Finding the initial cell takes O(1) whereas searching through a linked list takes O(k) when there are k number of elements in the list.

Thus, we do not want the linked lists become too full either.

Especially, if your hash function is not good.

However, the load factor in separate chaining can rise above 1 without hurting performance too much assuming hashCode method is good.

When in doubt, you may consider to use separate chaining, especially, if the number of items that will be inserted into a hash table is unknown. In other words, separate chaining would be better when you would expect to have a high load factor.

A few ways to deal with collisions.

Linear Probing:

When there is a collision, we try to find an empty cell sequentially and put the value into the nearest empty cell. However, this approach has an issue of forming the primary clusters and the performance can get really bad. It is necessary to rehash to keep the load factor from time to time.

There is also Quadratic Probing that has another subtle clustering issue called secondary clustering due to the fixed interval of probing.

To solve this clustering issue, there is another workaround, called Double Hashing, which uses two hash functions. One is to calculate hash value and the other is to decide the step size of probing.

Separate Chaining: The other workaround is to have a linked list at each index. This allows us to not to worry too much about load factor. However, we do not want to make the linked lists become too full either.

0 0