海量数据找中位数

来源：互联网发布：linux虚拟机nat设置编辑：程序博客网时间：2024/05/21 10:41

有几百亿的整数，分布的存储到几百台通过网络连接的计算机上，你能否开发出一个算法和系统，找出这几百亿数据的中值？就是在一组排序好的数据中居于中间的数。显然，一台机器是装不下所有的数据。也尽量少用网络带宽。

m1:

将0-2^32划分为若干个桶，每个电脑放一个桶，每个桶内有个计数器，记录有多少数放在桶内

从第一个桶的计数器开始加，一直加到大于n/2

所以中位数肯定在这个桶内

m2:

http://matpalm.com/median/distributing.html

http://matpalm.com/median/erlang_multi.html

利用快排的思想，但是只计数不交换位置，节省移动操作。

in the multi process implementation splits the list of numbers to consider into a sub lists
each sub list can be handled by a seperate erlang process, potentially across different machines
(these lists don't have to be the same size)

single [1,2,3,4,5,6,7,8,9]multi  [[1,2,3,4],[5,6],[7,8,9]]

recall there a number of operations required in distributed case
includes things like determining total number of elements, determining minimum value, etc
each of these operations can be done against each sub list individually with the results aggregated

eg to determine total number of elements

single -> length( [1,2,3,4,5,6,7,8,9] ) = 9multi  -> sum ( length([1,2,3,4]), length([5,6]), length([7,8,9]) ) = sum([4,2,3]) = 9

eg to determine minimum value

single -> min( [1,2,3,4,5,6,7,8,9] ) = 1multi  -> min( min([1,2,3,4]), min([5,6]), min([7,8,9]) ) = min([1,5,7]) = 1

other changes required

rotation

in the single list case we pick the pivot as the first value.
in the multi list case we pick the pivot as the first value of the first list.
recall in the algorithm that rotation is sometimes required.
this is to ensure all potential values for a pivot are explored.
so in the multi list case the rotation needs to operate at two levels; rotate the first list and then rotate the list of lists

before [[1,2,3],[4,5,6],[7,8,9]]after  [[4,5,6],[7,8,9],[2,3,1]]

给你1T(10^12)的int64整数，和仅有的1GB(10^9)的内存，如何设计算法和程序来找到它们的中值

大数据处理的题，都是采用分块的。int64bit=8bytes.所以1G内存只能表示1GB/8B=2^27个数。1.先将1G内存分成M=（2^27-1）个块，每块数范围为range=(2^64)/M; 2.创建一大小为M的数组A。A[0]表示数范围是0-(range-1)。3.从头扫描一次，数组A元素++；3.这样就可以缩小范围。之后再继续缩小范围