CareerCup Find the usernames existing in both documents
来源:互联网 发布:5.1声道测试软件 编辑:程序博客网 时间:2024/06/06 08:41
Given two log files, each with a billion usernames (each username appended to the log file), find the usernames existing in both documents?
------------------------------------------------------------------------------------------------------
Two approaches
1. Build Hash with smaller file and check for match in other Time O(n) Space O(n)
2. Sort both files (nlogn) and now find match with 2 fingered approach O(n) total complexity O(nlogn) + O(n) = O(nlogn)
Similar Problem:
给定A、B两个大文件,各存放50亿个url,每个url各占256字节,内存限制是4G,让你找出同时在A和B中出现的url。
答:
方法1:使用BloomFilter(一种类似于hash表但比hash表占用空间更小的查重数据结构),通过K个不同的hash函数,将5G个URL映射到32G个bit位上,当且仅当K个hash函数得到的bit位上都是1时,代表该url重复出现。一般来讲K取8。该方法存在精度损失。时间复杂度O(n)。
方法2:用一个hash函数将A的5G个url分散到5*256/4=320个文件中(A0,A1..),相同文件的url的hash值%320相等。这样每个文件平均为4G大小。对B做同样处理(B0,B1…)。然后顺序处理Ai与Bi即可,此时只需要使用简单的hash表将url全部倒入内存。这种方法比方法1得到的答案更精确,但同时速度也更慢,因为方法1只有10G次读操作,方法2需要20G次读与10G次写(如果都不算答案输出的写操作的话)。
面试官角度:
小内存中大文件处理的解答方法主要有如下几个角度:
1. 考虑精确结果和不精确结果采用不同的算法
2. 尽量减少文件写操作
3. 使用BloomFilter
4. 使用MapReduce
尝试这4个角度去解答,总不会错。对于这类问题,时间复杂度已经不是主要考点了。
- CareerCup Find the usernames existing in both documents
- Talk of computer technology in the enterprise management of the existing documents and file the application of three phase
- CareerCup Find the biggest interval that has all its members in list in O(n)
- CareerCup Find the ceiling value present in the BST of a given ke
- CareerCup Find the diameter of the tree
- CareerCup Write the code to find lexicographic minimum in a circular array
- [LinkedIn] Find the 100 most frequently occurring words in a set of documents.
- ASP.NET error: The type exists in both directories
- CareerCup Find a subarray
- CareerCup Find the representation as a tree with the least height
- CareerCup Find top k values (asec) which can either be the number from the array A
- find element in both array and vector using one myfind function
- CareerCup Find the no. of expressions that evaluate to a Walprime
- CareerCup Find all the conflicting appointments from a given list of n appointments.
- CareerCup Find out the winning probability given n, m and x
- CareerCup Finds all the elements that appear more than n/3 times in the list
- CareerCup Fill the array with product of all numbers except the number in that cell
- CareerCup Liars Merge-Find Set
- UVALive 4080 Warfare And Logistics(最短路树)
- struts2文件上传类型的过滤
- Unit8_problem1_复数类中的运算符重载
- 客户端程序设计
- 关于五子棋AI的一点小尝试
- CareerCup Find the usernames existing in both documents
- Android AIDL 必看内容
- 每一位Android开发者应该知道的Android体系架构和开发库
- @NotEmpty、@NotBlank、@NotNull
- hdu2874
- Socket的重要性
- 我的《鸟哥的Linux私房菜》笔记(六)档案与文件系统的压缩与打包
- 黑马程序员 Java基础学习笔记 线程安全问题
- 前序、中序、后序表达式