Use External Storage Process Big Data(1)

来源：互联网发布：虚拟仿真软件编辑：程序博客网时间：2024/06/05 21:17

Problem:

We discussed big data is that data can not fit in main memory(often called RAM, for Random Access Memory) all at once, how would you handle this situation?

Solution:

We can use Divide-Conquer algorithm to solve big problem by dividing it into small problems, then solving every small problem with the same method, and finally merge every results. In this case a different kind of storage is necessary. Disk files generally have a much larger capacity than main memory, but we should clearly know that external storage is much slower than main memory. This speed difference means that different techniques must be used to handle it efficiently.

Here we suppose our big data(suppose holds many records) are in a file. We can divide the file into blocks(data is stored on the disk in chunks called blocks,pages,allocation units; the disk drive always reads or writes a minimum of one block of data at a time; here block can be the biggest size your main memory can afford；Data is read from and written to disk in units known as blocks. The Block Size property specifies the number of bytes per block.) , then we can read the block what we want into main memory. But the problem is how can you find the block quickly.

Problem:How can you find the block quickly?

Solution:

We must keep in mind a fact that the time to access a block is much larger than any internal processing on data in main memory, so the overriding consideration in devising an external storage strategy is minimizing the number of block accesses. The usually techniques to handle this problem are hashing, index and B-tree.

1 Hashing and External Storage

The central feature in external hashing is a hash table containing block numbers, which refer to block in external storage. The hash table is sometimes called an index (in the sense of a bool's index). It can be stored in main memory or, if it is too large, stored externally on disk, with only part of it being read into main memory at a time.

1)Firstly, all records with keys that hash to the same value are located in the same block.

2)Secondly, to find a record with a particular key, the search algorithm hashes the key, uses the hash value as an index to the hash table, gets the block number at that index, and reads the block.

To implement this scheme, we must choose the hash function and the size of the hash table with some care so that a limited number of keys hash to the same value.

For example:

We can put all the blocks in a catalog, and the hash values are the bock files names. So you can find the block file according the block file name. For instance, if your search key's hash value is 2, then you can find the 2.txt file and read it into main memory because all the keys with the same hash value are in the same block.

You may confused the 11.txt in the above figure. 11.txt is the overflow bock file of 1.txt if the 1.txt is full. This is the separate chaining method to handle the full blocks, of course, you can use other methods to find the overflow blocks. In seperate chaining, special overflow blocks are made available; when a primary block is found to be full,the new record in the overflow block.