【详解】Python处理大量数据与DICT遍历的优化问题

来源：互联网发布：上海博思游戏学校知乎编辑：程序博客网时间：2024/06/08 18:13

前言：本例我们的需求是写一个每天0点运行的脚本。这个脚本从一个实时更新的数据库中提取数据。

每天跑一个Excel表出来，表里是当天零点与昨天零点时的差异的数据展示。

其实是很简单的需求，遇到的关键问题是数据量。该例的数据量太大，每次都能从数据库中拿出20多万条数据。

数据量大的话遇到的问题有这么几个：

1. 数据无法装入Excel表，因为使用Python处理Excel数据，最多插入65536行数据，多了就会报错；

2. 遍历筛选问题。我们拿到两天的数据进行对比，然后生成一个差异对比表。就需要遍历对比两张表的数据，数据量太大，遍历所用时间过长。

对这两个关键的问题，我们现作阐述。

【问题一：Excel表改为Csv表】

我们发现，Csv格式的表，是没有行数限制的，我们可以把20多万条数据直接插入csv表中。

【问题二：DICT类型数据的遍历】

按我们以往的经验，生成对比信息的字典代码如下：

def getCurrentCompareMessageDict0(dict0, dict1):    '''未被优化的获取当前对比信息字典'''    dlist0=list(dict0.keys())    dlist1=list(dict1.keys())    dict2={}    for i in range(len(dlist1)):        if dlist1[i] not in dlist0:            key=dlist1[i]            value=[0, dict1[dlist1[i]]]            dict2[key]=value        else:            if dict1[dlist1[i]]/100.0 != dict0[dlist1[i]]:                key=dlist1[i]                value=[dict0[dlist1[i]], dict1[dlist1[i]]]                dict2[key]=value    return dict2

即，先构建两个dict的key列表。

然后，以key列表的长度为上限，进行for循环，采用DICT[KEY]的方式来进行列表数据的筛选。

这个方法的运行是超级慢的。

经过研究我们将该方法改进如下：

def getCurrentCompareMessageDict(dict0, dict1):    '''优化的获取当前对比信息字典'''    dict2={}    i=0    for d, x in dict1.items():        if dict0.has_key(str(d)):            if x/100.0 != string.atof(dict0[str(d)]):                key=d                value=[string.atof(dict0[str(d)]), x]                dict2[key] = value        else:            key=d            value=[0, x]            dict2[key]=value    return dict2

采用该方法后，两组20多万条数据的比对筛选，在1秒内就完成了。

经测试，优化方法后速度提高了大约400倍！

这个方法优化了哪里呢？

首先，遍历dict的方法改为了

 for d, x in dict1.items():

其中，d为key，x为value。其实也可以这样写

 for (d, x） in dict1.items():

网上找到的资料称，加括号的在200次以内的遍历效率较高，不加括号的在200次以上的遍历效率较高。（参考链接：python两种遍历方式的比较）

我们没有去测试，采用了不加括号的方式。

其次，检测某key是否存在于dict中的方法改为了

if dict0.has_key(str(d)):

这个has_key函数返回的是布尔值True或False。

原先的检测方法：

if dlist1[i] not in dlist0:

舍弃！

其实提高了效率的部分就两步，遍历和检测！至于到底是哪一步提高了，……应该是都提高了。

因为这两步的代码不是分开的，是联系在一起的。

只有采用了for d,x in dict.items()这种遍历方法,才能够直接使用d和x这两个参数，而不用取值。

关键问题就是如上两个。还有过程中遇到的几个问题需要阐述一下：

1. python比较两个数组中的元素是否完全相等的问题。

>>> a = [(1,1),(2,2),(3,3),(4,4)]>>> b = [(4,4),(1,1),(2,2),(3,3)]

>>> a.sort()>>> b.sort()

>>> a==bTrue

即，先排序后比较。只检验其中的元素是否一致，不考虑顺序的影响。

参考链接：python比较两个数组中的元素是否完全相等

2.python如何将字符串转为数字？

最终代码中我们用到了

string.atof(浮点数字符串)

string.atoi(整数字符串)

注意：需要

import string

3.读取csv文件

我们之前都是写csv文件。这里需要读，并将其中的数据装入dict中，方便使用。

方法如下：

def getHandleDataDict(fileName):    '''获取昨天零点数据字典'''    dict={}    csvfile=file(fileName, 'rb')    reader=csv.reader(csvfile)    for i in reader:        key=i[0]        value=i[1]        dict[key]=value    return dict

关键代码两行：

    csvfile=file(fileName, 'rb')    reader=csv.reader(csvfile)    for i in reader:

i 就是dict中每条数据。每个i是个列表，i[0]是key，i[1]是value。

4.Python的KeyError

这个错误我们不是第一次遇到，这里着重说明，以示重视

KeyError的意思是：dict中不存在这个键。这种情况，我们如果dict[key]去取这个key对应的value，就会报KeyError的错误。

有可能是key的数据类型出错，也有可能就是不存在这个键，两种情况都要考虑。

我们在本例中遇到了数据类型出错的情况，所以才会有2问题，将字符串转为数字blabala。。。。

【脚本撰写思想阐述】

还有一个脚本的撰写思想，先贴出最终版代码如下。

#!/usr/bin/python# -*- coding: UTF-8 -*-__author__ = "$Author: wangxin.xie$"__version__ = "$Revision: 1.0 $"__date__ = "$Date: 2015-01-05 10:01$"################################################################功能： 当前0点与昨天0点余额信息对比表,每天00:00运行###############################################################import sysimport datetimeimport xlwtimport csvimport stringfrom myyutil.DBUtil import DBUtil#######################全局变量####################################memberDBUtil = DBUtil('moyoyo_member')today = datetime.datetime.today()todayStr = datetime.datetime.strftime(today, "%Y-%m-%d")handleDate = today - datetime.timedelta(1)handleDateStr = datetime.datetime.strftime(handleDate, "%Y-%m-%d")fileDir = 'D://'handleCsvFileName= fileDir+handleDateStr+'_balance_data.csv'currentCsvfileName = fileDir+todayStr+'_balance_data.csv'currentexcelFileName= fileDir+todayStr+'_balance_compare_message.xls'style1 = xlwt.XFStyle()font1 = xlwt.Font()font1.height = 220font1.name = 'SimSun'style1.font = font1csvfile1=file(currentCsvfileName, 'wb')writer1 = csv.writer(csvfile1, dialect='excel')##################################################################def genCurrentBalanceData():    '''获取当前余额数据'''    sql = '''        SELECT MEMBER_ID,        (TEMP_BALANCE_AMOUNT + TEMP_FROZEN_AMOUNT)        FROM moyoyo_member.MONEY_INFO        WHERE (TEMP_BALANCE_AMOUNT + TEMP_FROZEN_AMOUNT) != 0    '''    rs = memberDBUtil.queryList(sql, ())    if not rs: return None    return rsdef getCurrentDataDict(rs):    '''将当前数据组装为字典'''    dict={}    for i in range(len(rs)):        key=rs[i][0]        value=rs[i][1]        dict[key]=value    return dictdef writeCsv(x,writer):    '''csv数据写入函数'''    writer.writerow(x)def writeCurrentCsvFile():    '''写包含当前数据的csv文件'''    rs=genCurrentBalanceData()    dict=getCurrentDataDict(rs)    for d, x in dict.items():        writeCsv([d, x/100.0], writer1)    csvfile1.close()    return dictdef getHandleDataDict(fileName):    '''获取昨天零点数据字典'''    dict={}    csvfile=file(fileName, 'rb')    reader=csv.reader(csvfile)    for i in reader:        key=i[0]        value=i[1]        dict[key]=value    return dictdef getCurrentCompareMessageDict(dict0, dict1):    '''获取当前对比信息字典'''    dict2={}    for d, x in dict1.items():        if dict0.has_key(str(d)):            if x/100.0 != string.atof(dict0[str(d)]):                key=d                value=[string.atof(dict0[str(d)]), x]                dict2[key] = value        else:            key=d            value=[0, x]            dict2[key]=value    return dict2def writeExcelHeader():    '''写Excel表表头'''    wb = xlwt.Workbook(encoding = "UTF-8", style_compression = True)    sht0 = wb.add_sheet("余额信息对比列表", cell_overwrite_ok = True)    sht0.col(0).width=3000    sht0.col(1).width=4000    sht0.col(2).width=4000    num=today.day    sht0.write(0, 0, '用户ID', style1)    sht0.write(0, 1, str(num-1)+'日零点余额', style1)    sht0.write(0, 2, str(num)+'日零点余额', style1)    return wbdef writeCurrentCompareMessageInfo(sht,dict):    '''写当前对比信息数据'''    dlist=list(dict.keys())    for i in range(len(dlist)):        sht.write(i+1, 0, dlist[i], style1)        sht.write(i+1, 1, dict[dlist[i]][0], style1)        sht.write(i+1, 2, dict[dlist[i]][1]/100.0, style1)def writeCurrentCompareMessageExcel(dict):    '''写当前对比信息Excel表'''    wb = writeExcelHeader()    sheet0 = wb.get_sheet(0)    writeCurrentCompareMessageInfo(sheet0, dict)    wb.save(currentexcelFileName)def main():    print "===%s start===%s"%(sys.argv[0], datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d %H:%M:%S"))    currentDataDict=writeCurrentCsvFile()    handleDataDict = getHandleDataDict(handleCsvFileName)    currentCompareMessageDict = getCurrentCompareMessageDict(handleDataDict, currentDataDict)    writeCurrentCompareMessageExcel(currentCompareMessageDict)    print "===%s end===%s"%(sys.argv[0], datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d %H:%M:%S"))if __name__ == '__main__':    try:        main()    finally:        if memberDBUtil: memberDBUtil.close()

之所以要说，脚本撰写思想。

是因为我们在写这个脚本时，需要注意的很多问题，没有加以重视。

尤其是流程方面。先做什么后做什么，拿到的数据如何处理。有没有可以省去的步骤之类的。

都是在写各个方法时需要注意的。

思想一：脚本运行时间的指导作用

我们这个脚本需求里说，脚本需要在每日零点取数据。数据库中的数据是实时改变的。

所以既然要求了0点取数据，所以取数据的方法肯定是要放在最前面的。

即，脚本的方法排列，与脚本要求的运行时间是有密切关系的。

脚本为什么要选在0点运行，0点的时候干了些什么，是需要我们多加考虑的。

因为，最终影响的是数据的准确性。

即，如果我们先运行了别的方法，比如读取昨天0点的csv文件之类的方法。

读了20多秒后，才运行这个取数据的方法。这时候取的数据就不是零点数据了。

思想二：不要重复劳动。

我们来分析一下本例中的数据流向。

Dict0--------昨天0点的数据在csv中。

Dict1--------该脚本于当日0点运行时从数据库中取的数据。先写入csv中。

Dict2--------昨天的数据与刚跑出来的数据，经过对比筛选出来的差异数据字典。

需要注意的是生成Dict2时代码的操作。Dict0的数据自然是直接取，Dict1的数据存在于代码中的Dict1,可以直接return。

但是之前我们犯了一个错误，Dict1的数据我们从刚生成的csv文件中提取。

这样是没有必要的。我们直接从代码中取就可以。这个数据代码中就有，不需要到文件中提取了。

会因为这个无端延长脚本的运行时间的。属于基本的逻辑疏漏。

所以最终版代码中的这个方法。

def writeCurrentCsvFile():    '''写包含当前数据的csv文件'''    rs=genCurrentBalanceData()    dict=getCurrentDataDict(rs)    for d, x in dict.items():        writeCsv([d, x/100.0], writer1)    csvfile1.close()    return dict

在写完csv文件后，用过的dict就直接return了，因为后面还要用。

生成的csv文件只是为了与明天的数据作对比。

思想三：数据产生的意义。

犯了上述错误。我们可以反思一下，数据的作用。。还有文件的作用。

我们生成dict是为了什么，当然数据可能不止一个作用，这个要注意。

csv0是为了提供dict0，dict0是为了与当天数据对比。

dict1是为了生成明天的csv，还有生成当天的dict2。

即，csv1根本不是为了dict1而存在的。只是为了为明天而做准备。

明白了这一点，就不会做出从csv1中取dict1的傻事了。

0 0