Use External Storage Process Big Data(1)
来源:互联网 发布:虚拟仿真软件 编辑:程序博客网 时间:2024/06/05 21:17
We discussed big data is that data can not fit in main memory(often called RAM, for Random Access Memory) all at once, how would you handle this situation?
Solution:
We can use Divide-Conquer algorithm to solve big problem by dividing it into small problems, then solving every small problem with the same method, and finally merge every results. In this case a different kind of storage is necessary. Disk files generally have a much larger capacity than main memory, but we should clearly know that external storage is much slower than main memory. This speed difference means that different techniques must be used to handle it efficiently.
Here we suppose our big data(suppose holds many records) are in a file. We can divide the file into blocks(data is stored on the disk in chunks called blocks,pages,allocation units; the disk drive always reads or writes a minimum of one block of data at a time; here block can be the biggest size your main memory can afford;Data is read from and written to disk in units known as blocks. The Block Size property specifies the number of bytes per block.) , then we can read the block what we want into main memory. But the problem is how can you find the block quickly.
Problem:How can you find the block quickly?
Solution:
We must keep in mind a fact that the time to access a block is much larger than any internal processing on data in main memory, so the overriding consideration in devising an external storage strategy is minimizing the number of block accesses. The usually techniques to handle this problem are hashing, index and B-tree.
1 Hashing and External Storage
The central feature in external hashing is a hash table containing block numbers, which refer to block in external storage. The hash table is sometimes called an index (in the sense of a bool's index). It can be stored in main memory or, if it is too large, stored externally on disk, with only part of it being read into main memory at a time.
1)Firstly, all records with keys that hash to the same value are located in the same block.
2)Secondly, to find a record with a particular key, the search algorithm hashes the key, uses the hash value as an index to the hash table, gets the block number at that index, and reads the block.
To implement this scheme, we must choose the hash function and the size of the hash table with some care so that a limited number of keys hash to the same value.
For example:
We can put all the blocks in a catalog, and the hash values are the bock files names. So you can find the block file according the block file name. For instance, if your search key's hash value is 2, then you can find the 2.txt file and read it into main memory because all the keys with the same hash value are in the same block.
You may confused the 11.txt in the above figure. 11.txt is the overflow bock file of 1.txt if the 1.txt is full. This is the separate chaining method to handle the full blocks, of course, you can use other methods to find the overflow blocks. In seperate chaining, special overflow blocks are made available; when a primary block is found to be full,the new record in the overflow block.
- Use External Storage Process Big Data(1)
- big data use
- Android Data Storage(数据存储)值External Storage
- External Storage
- External Storage
- Big Data (1)
- Use IE userdata behavior as a client-side data storage
- The Storage Situation: External Storage
- External Storage Technical Information
- External Storage Technical Information
- Android--------External Storage
- 1003-content- external-storage
- Big Data--1, 初识hadoop
- Big Data 学习笔记【1】
- Big Data课程总结 ( 1 )
- Data Storage
- Data Storage
- BIG DATA
- Android的CursorAdapter与CursorFilter机制
- 环境搭建
- HDU 1722 Cake
- 归并排序 MergeSort
- sql操作字段
- Use External Storage Process Big Data(1)
- SIFT解析(三)生成特征描述子
- 二分查找 初篇
- How To Collect 10046 Trace (SQL_TRACE) Diagnostics for Performance Issues
- 效率
- 2013/3/5作业题1(输入整数并输出整数的乘积)
- makfile中.PHONY的认识
- 捕获Android运行时改变
- Adapter比较