海量日志，提取出现最多的IP--python实现

来源：互联网发布：nginx ssl 泛域名编辑：程序博客网时间：2024/05/17 04:33

看过这篇文章教你如何迅速秒杀掉：99%的海量数据处理面试题，文中的第一道题片石，海量日志数据，提取出某日访问百度次数最多的那个IP。所以本文我用自己的思路实现了这个问题。

试想一下，如果日志文件中，所有相同的 IP 都是相邻的，那是不是扫描一遍文件就可以找出数量最多的那个？这便是本文思路。

而排序正好是一个令相同 IP 相邻的不错的办法。排序就要作比较，而 IP 是诸如 "188.62.136.28" 之类的字符串，如何比较大小？其实，在 python 语言里字符串是可以比较的，规则是这样的：第一个字符大的为大，若相等，依次向后比较。所以，我们完全没必要理会 IP 中的字符 ‘.’ 位置，完全遵照 python 语言自有的“规矩”。比如 "9.131.255.66" > "255.255.255.255" 这是很正常的。

说到这里，问题的实质便是大数据的排序问题了。数据太大，有限内存不够用，那就只能是大而化小、分而治之、整合归并。本例中，为了节省时间，使用的是一个只有100w IP 的文件。

本文中，对日志文件的分割是等分的，然后分别对分割后的小文件进行内部排序，最后败者树归并。其实可以利用置换-选择排序来减少产生的小文件数量，下一篇博客便是这样实现的。

下面是完整实现：

1.生成日志文件——MakeIPs.py

代码：

#!/usr/bin/python# Filename MakeIPs.py__author__ = 'ihippy'__email__ = 'ihippy@163.com'__date__ =  '$Date: 2013.05.08 $'import randomdef makeRandom(firstNum, lastNum):return random.randint(firstNum, lastNum)def makeIP(filePath, numberOfLines):try:IP = []file_handler = open(filePath, 'a+')for i in range(numberOfLines):IP.append(str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '.' + str(makeRandom(0, 255)) + '\n')file_handler.writelines(IP)file_handler.close()except EOFError:print 'Operate Failed!'if __name__ == '__main__':import systry:filePath = sys.argv[1]lineNum = int(sys.argv[2])except:print 'Wrong Arguments!'print '''You need 2 Parameters in total.1. The path of the target file.2. The Number of lines in the target file.You Should do like this:python /root/hehe/file 1000000'''sys.exit()from time import ctimeprint 'The time now is: ',print ctime()print 'Start...'if lineNum > 1000000:a = lineNum / 1000000b = lineNum % 1000000for i in range(0, a):makeIP(filePath, 1000000)makeIP(filePath, b)else:makeIP(filePath, lineNum)print 'Work Down, and the time now is: ',print ctime()

运行截图：

2.分割——SplitFile.py

代码：

#/usr/bin/python# Filename: SplitFile.py__author__ = 'ihippy'__email__ = 'ihippy@163.com'__date__ = '$Date: 2013.05.08 $'def splitFile(fileLocation, targetFoler, blockSize):'''Split Big File in fileLocation to little ones to targetFoler'''file_handler = open(fileLocation, 'r')line = file_handler.readline()temp = []countFile = 1while line:for i in range(blockSize):temp.append(line)line = file_handler.readline()if i == (blockSize-1):file_writer = open(targetFoler + '/file_' + str(countFile) + '.txt', 'w')file_writer.writelines(temp)file_writer.close()temp = []print 'file' + str(countFile) + ' created at:' + str(ctime())countFile += 1file_handler.close()if __name__ == '__main__':import sysfrom time import ctimetry:fileLocation = sys.argv[1]targetFoler = sys.argv[2]blockSize = int(sys.argv[3])except:print 'Wrong Arguments!'print '''You neew 3 Parameters in total.1. The path of your file.2. The path of the target files.3. The number of lines of the little file you want to spilt to.You should do like this:python SplitFile.py /root/hehe.txt /root/hehe/ 100'''sys.exit()print 'The time now is: ',print ctime()print 'Start...'splitFile(fileLocation, targetFoler, blockSize)print 'Work Down, and the time now is: ',print ctime()

运行截图：

效果截图：

3.内部排序(这里使用的堆排序)——SortIPs.py

代码：

#!/usr/bin/python# FileName: SortIPs.py# If the node has only one childdef changeTwoIPs(ipList, i):if ipList[i] > ipList[i*2+1]:ipList[i], ipList[i*2+1] = ipList[i*2+1], ipList[i]# If the node has two childrendef changeThreeIPs(ipList, i):if ipList[i] > ipList[i*2+1] and ipList[i] > ipList[i*2+2]:if ipList[i*2+1] > ipList[i*2+2]:ipList[i], ipList[i*2+2] = ipList[i*2+2], ipList[i]return i*2+2else:ipList[i], ipList[i*2+1] = ipList[i*2+1], ipList[i]return i*2+1elif ipList[i] > ipList[i*2+1] and ipList[i] <= ipList[i*2+2]:ipList[i], ipList[i*2+1] = ipList[i*2+1], ipList[i]return i*2+1elif ipList[i] <= ipList[i*2+1] and ipList[i] > ipList[i*2+2]:ipList[i], ipList[i*2+2] = ipList[i*2+2], ipList[i]return i*2+2return None# From the node search downward, until reach the last parent node.def adjustHeap(ipList, i):s = len(ipList)last = s / 2 - 1while i != None and i < last:i = changeThreeIPs(ipList, i)if i == last:if s % 2 == 0:changeTwoIPs(ipList, last)# last has one child.else:changeThreeIPs(ipList, last)# last has two childrendef heapSortIPs(ipList):'''Sort the IPs by little to large in my defination.'''i = len(ipList) / 2 - 1while i >= 0:adjustHeap(ipList, i)i -= 1b = []n = len(ipList)for i in range(n):ipList[0], ipList[len(ipList)-1] = ipList[len(ipList)-1], ipList[0]b.append(ipList.pop(len(ipList)-1))adjustHeap(ipList, 0)return b#---------------------------Test------------------------------if __name__ == '__main__':from time import ctimeimport systry:filePath = sys.argv[1]num1 = int(sys.argv[2])num2 = int(sys.argv[3])except:print 'Wrong Arguments!'print '''You need 3 Parameters in total.1. The path of your files.2. The Number of the first file's name contains.3. The Number of the last file's name contains.You Should do like this:python MergerFile.py /root/hehe/file 1 5'''sys.exit()print 'Now the time is ' + str(ctime()) + ','print 'and the work is coming, please to wait...'for i in range(num1, num2 + 1):a = []file_handler = open(filePath+str(i)+'.txt', 'r')line = file_handler.readline()while line:a.append(line)line = file_handler.readline()file_handler.close()a = heapSortIPs(a)file_writer = open(filePath+str(i)+'.txt', 'w')file_writer.writelines(a)file_writer.close()print 'Work Over!'print 'Now the time is ' + str(ctime())

运行截图：

效果截图：

4.败者树归并——LoserTree.py、MergerFiles.py

败者树代码：

#!/usr/bin/python# Filename: LoserTree.pydef createLoserTree(loserTree, dataArray, n):'''Initialize the loser tree and data array bythe branch number n. Assign all members of the loser tree and the data array.  And adjust the to a real 'Loser Tree'.'''for i in range(n):loserTree.append(0)dataArray.append(i-n)for i in range(n):adjust(loserTree, dataArray, n, n-1-i)# Unlike the HeapSort, the LoserTree adjust from bottom to top.def adjust(loserTree, dataArray, n, s):t = (s + n) / 2while t > 0:if dataArray[s] > dataArray[loserTree[t]]:s, loserTree[t] = loserTree[t], st /= 2loserTree[0] = s#---------------------------Test------------------------------if __name__ == '__main__':import randoma = 10loserTree = []dataArray = []createLoserTree(loserTree, dataArray, a)print 'At first, the loser tree and the data array are at below.'print 'Loser Tree:'print loserTreeprint 'Data Array:'print dataArray# Adjust the loserTree every time change one item of the dataArray.for i in range(a):        dataArray[i] = random.randint(0, 500)adjust(loserTree, dataArray, a, i)print '\nAfter change the data array is:'print dataArrayprint 'And the loser tree is:'print loserTreeprint 'The least number now is the dataArray[%d] and it is %d.' % (loserTree[0], dataArray[loserTree[0]])print '\nChange the %d number to a randomi number \between 0 and 500.' % (loserTree[0]+1)dataArray[loserTree[0]] = random.randint(0,500)print 'Now the data array is:'print dataArrayprint 'Adjust it...'adjust(loserTree, dataArray, a, loserTree[0])print 'Now the loser tree is:'print loserTreeprint 'The new least number now is the dataArray[%d] and it is %d.' % (loserTree[0], dataArray[loserTree[0]])

归并代码：

#!/usr/bin/python# Filename : MergerFile.py__author__ = 'ihippy'__email__ = 'ihippy@163.com'__date__ =  '$Date: 2013.05.08 $'# A method to write items in the array into file.def writeFile(tarDir, tmp):file_writer = open(tarDir, 'a+')file_writer.writelines(tmp)file_writer.close()if __name__ == '__main__':from time import ctimeimport sysimport osfrom LoserTree import *try:fileLocation = sys.argv[1]fileNum = int(sys.argv[2])a = int(sys.argv[3])except:print 'Wrong Arguments!'print '''You need 3 Parameters in total.1. The path of your files.2. The Number of the first file's name contains.3. The size of the Loser Tree.You Should do like this:python MergerFile.py /root/hehe/file 1 5'''sys.exit()print 'Now the time is ' + str(ctime()) + ','print 'and the work is coming, please to wait...'loserTree = []dataArray = []createLoserTree(loserTree, dataArray, a)# This array is used to storage file readersfile_reader = []for i in range(a):try:file_reader.append(open(fileLocation+str(i+fileNum)+'.txt', 'r'))dataArray[i] = file_reader[i].readline()adjust(loserTree, dataArray, a, i)# If failed read file, make the corresponding ariable to the char 'F'.except:dataArray[i] = 'F'adjust(loserTree, dataArray, a, i)if dataArray[loserTree[0]] == 'F':print 'No files.'sys.exit()# A temporary array to storage sorted ips.tmp = []while dataArray[loserTree[0]] != 'F':tmp.append(dataArray[loserTree[0]])try:line = file_reader[loserTree[0]].readline()if line:dataArray[loserTree[0]] = line# Reach the end of the file.else:dataArray[loserTree[0]] = 'F'adjust(loserTree, dataArray, a, loserTree[0])except:dataArray[i] = 'F'adjust(loserTree, dataArray, a, loserTree[0])# If the number of items in tmp over 1000000, write them into file.if len(tmp) == 1000000:writeFile(fileLocation + 'A' + str((fileNum-1)/a+1) + '.txt', tmp)tmp = []# If tmp isn't empty, write them to file.writeFile(fileLocation + 'A' + str((fileNum-1)/a+1) + '.txt', tmp)tmp = []while file_reader:file_reader.pop().close()# Remove old files.for i in range(a):command = 'rm -rf ' + fileLocation + str(i+fileNum) + '.txt'if os.system(command) == 0:print 'Remove old File Success!'else:print 'Failed!!!'print 'Work Over!'print 'Now the time is ' + str(ctime())

运行截图：

排序完成：

下面是排序前后的文件的截图对比，为了验证数据完整性，我们可以对比排序前后的字节数，如图可知，排序之后文件完整。

5.最后一步——FindTheIP.py

代码：

#!/usr/bin/python# Filename: FindTheIP.pyimport sysif __name__ == '__main__':try:path = sys.argv[1]except:print 'Input the path of the file.'sys.exit()try:file_reader = open(path,'r')except:print 'Exception occured. Maybe the path you just input is invalid.'sys.exit()line = file_reader.readline()ip = linenum = 0maxNum = 0maxIP = lineprint 'Calculating...'while line:if line == ip:num += 1if num >= maxNum:maxNum += 1maxIP = ipelse:ip = linenum = 0line = file_reader.readline()file_reader.close()print 'IP %s is the most, it\'s number is %d' % (maxIP, maxNum)

运行与效果截图：

当然，由于数据量较少(只有100w条)，而且 IP 都是随机产生的，所以出现最多的只有2个，这很正常。

至此，大功告成！