python脚本编程:批量对比文本文件,根据具体字段比较差异

来源:互联网 发布:sql查询去掉重复行 编辑:程序博客网 时间:2024/06/06 02:28

有时候又这样的需求,有两个文件(里面是表形式的数据,字段有重合也有不一样的),需要对比两个文件之间的差异数据记录并摘出来

文件示例

A文件表每条记录的格式:

03090000   00049993   9222100502392220106000000020000029000170124500019054                 20170124 12:30:01622908347435512917       00049996   

B文件表格式

01006530    00096900    000480 0124174505 6228480478369552177 000000004066 000000000000  00000000000 0200 000000 5411 00000021 100504754110404 003081009289 00 000000 01030000    000000 00 071 000000000005 000000000000 D00000000001 1 000 6 0 0124174510 01030000    0 03     00000000000  00010111001   

其中A文件有若干条记录,B文件也有若干条记录,B文件中有些记录对应的索引号在A文件中没有,现在需要找出这些记录,比如:0124174510这个字段对应在A中9222100502392220106000000020000029000170124500019054字段的后12位,根据字符串分割去批量匹配出这样的缺失数据

代码

# dates to be compareddateArr = ["170124",            "170125",            "170130",           "170206",            "170211",            "170228",            "170304",            "170309",            "170314",           "170321",            "170325"]# local path that contains datasrc_dir = "./src_data"res_dir = "./res_data"# the exact merchant ID to be concernedgMchtId = "100502392220106"# read files and compare, then write as recordsprint "start to compare file..."for dateStr in dateArr:    print "comparing " + dateStr + " files"    mic_file_name = "M_IC" + dateStr + "OTRAD100502392220106"    acom_file_name = "no_chongzhengIND" + dateStr + "01ACOM"    # define mic set at this date     micIndexSet = set()    # read mic file and create index keys    print "reading " + dateStr + " mic file"    with open(src_dir + '/' + mic_file_name, 'r') as micFileStream:        # process file line by line        for micLineStr in micFileStream:            # pass the empty line            if len(micLineStr) == 0:                print "empty mic line"                break            # slice strings             micLineDataArray = micLineStr.split()            combinedInfo = micLineDataArray[2]            micMchtId = combinedInfo[4:19]             # pass wrong merchant ids            if micMchtId != gMchtId:                continue            # get query index            micIndex = combinedInfo[-12:]            # add to mic index set            micIndexSet.add(micIndex)    # define linestr array to save the result lines    resultLineStr = list()    # read acom file and compare index keys    print "reading " + dateStr + " acom file"    with open(src_dir + '/' + acom_file_name, 'r') as acomFileStream:        # process file line by line        for acomLineStr in acomFileStream:            if len(acomLineStr) == 0:                print "empty acom line"                break            acomLineDataArray = acomLineStr.split()            acomMchtId = acomLineDataArray[12]            if acomMchtId != gMchtId:                continue            acomIndex = acomLineDataArray[13]            # save the diffed lines            if acomIndex not in micIndexSet:                resultLineStr.append(acomLineStr)    # write the result lines to file    print "write " + dateStr + " result file"    with open(res_dir + '/' + dateStr + "_result", 'w') as resultFileStream:        res_str = ""        for line in resultLineStr:            res_str += line + '\n'        resultFileStream.write(res_str)print "compare over"

截图

这里写图片描述
根据文件夹里文件的日期去批量拼文件名,结果置于另一文件夹,python处理速度还是不错的

0 0