Python多进程分块读取文件

来源：互联网发布：淘宝女装主图素材编辑：程序博客网时间：2024/06/10 12:09

最近在做日志分析，可恨的log动辄上G，如果线程处理往往是比较慢的，但是Python2.x中多线程其实不能好好利用到多处理器进行并发执行。所以就打算多进程分块来读入文件。

网上有位哥哥给了一份参考的源代码，http://www.oschina.net/code/snippet_97079_4465

# -*- coding: GBK -*-import urlparseimport datetimeimport osfrom multiprocessing import Process,Queue,Array,RLock"""多进程分块读取文件"""WORKERS = 4BLOCKSIZE = 100000000FILE_SIZE = 0def getFilesize(file):    """        获取要读取文件的大小    """    global FILE_SIZE    fstream = open(file,'r')    fstream.seek(0,os.SEEK_END)    FILE_SIZE = fstream.tell()    fstream.close()def process_found(pid,array,file,rlock):    global FILE_SIZE    global JOB    global PREFIX    """        进程处理        Args:            pid:进程编号            array:进程间共享队列，用于标记各进程所读的文件块结束位置            file:所读文件名称        各个进程先从array中获取当前最大的值为起始位置startpossition        结束的位置endpossition (startpossition+BLOCKSIZE)         if (startpossition+BLOCKSIZE)<FILE_SIZE else FILE_SIZE        if startpossition==FILE_SIZE则进程结束        if startpossition==0则从0开始读取        if startpossition!=0为防止行被block截断的情况，先读一行不处理，从下一行开始正式处理        if 当前位置 <=endpossition 就readline        否则越过边界，就从新查找array中的最大值    """    fstream = open(file,'r')        while True:        rlock.acquire()        print 'pid%s'%pid,','.join([str(v) for v in array])        startpossition = max(array)                    endpossition = array[pid] = (startpossition+BLOCKSIZE) if (startpossition+BLOCKSIZE)<FILE_SIZE else FILE_SIZE        rlock.release()                if startpossition == FILE_SIZE:#end of the file            print 'pid%s end'%(pid)            break        elif startpossition !=0:            fstream.seek(startpossition)            fstream.readline()        pos = ss = fstream.tell()        ostream = open('/data/download/tmp_pid'+str(pid)+'_jobs'+str(endpossition),'w')        while pos<endpossition:            #处理line            line = fstream.readline()                                    ostream.write(line)            pos = fstream.tell()        print 'pid:%s,startposition:%s,endposition:%s,pos:%s'%(pid,ss,pos,pos)        ostream.flush()        ostream.close()        ee = fstream.tell()            fstream.close()def main():    global FILE_SIZE    print datetime.datetime.now().strftime("%Y/%d/%m %H:%M:%S")         file = "/data/pds/download/scmcc_log/tmp_format_2011004.log"    getFilesize(file)    print FILE_SIZE        rlock = RLock()    array = Array('l',WORKERS,lock=rlock)    threads=[]    for i in range(WORKERS):        p=Process(target=process_found, args=[i,array,file,rlock])        threads.append(p)    for i in range(WORKERS):        threads[i].start()        for i in range(WORKERS):        threads[i].join()    print datetime.datetime.now().strftime("%Y/%d/%m %H:%M:%S") if __name__ == '__main__':    main()

自己在上面做了些改进，也上马工作了。但是发现一件蛮出乎意料的事情，因为我用来文件迭代器（for line in file），文件都是按缓冲之间缓冲一个block进来的（猜测），所以每次调用file.tell()给出的结果都不对，感觉是下一个Block的起始位。所以感觉不能保证两个进程之间没有遗漏的地方(大部分应该是有遗漏的)。

网上看了看还是有蛮多人有这个问题的：

http://bugs.python.org/issue4633

http://www.reddit.com/r/learnpython/comments/vcg2y/filetell_returns_the_wrong_value_for_crlf_files/

See the documentation for file.next (http://docs.python.org/library/stdtypes.html#file.next).  As you can see, file.next uses a buffer which will mess with the result of other methods, such as file.tell.

但是按例子和有些人的建议用readline()，但整个文件（十几G）读得非常的慢，后来就像自己对读取的内容做计数，来判断最后是否完成了自己的任务。好傻，但是应该比较有用。而且事实证明对性能影响有限的说....