[Full script] 从网页获取数据写入Excel （API 模式）初稿

来源：互联网发布：国家工商局网络教学编辑：程序博客网时间：2024/06/05 06:27

配置：

1. 关于 Excel 读/写的时候可能会遇到的问题（基于Python 2.7）——

import xlwt
ImportError: No module named xlwt

Python 表示没有 xlwt (xlrd) 模块，怎么办？自己下啊！（下好了请解压到 Python 目录下）

自己安装啊！{ #这个仅仅只是一个步骤

win + r

cmd

cd C:\Python27\xlwt-1.0.0

python setup.py install #实际上就是运行该文件夹下面的 setup.py 文件

}

理想程序框架：

虽然是个小程序，但是为了确保自己的思路不在风中凌乱，分为下面的几个层次

Excel info 获取:

因为遍历n 张sheet 使用的是连续的 sheet 编号，所以在遍历每个表的时候最好检查一下表的名字：

print 'The current sheet name is: %s, and sheet num is %d' % (excelfile.sheet_names()[sheetnum],sheetnum)

excelfile.sheet_names()[sheetnum] 得到表名

只需要得到 issue id，而 issue id 必须要求是数字：注意，因为在excel 单元格中抓取的 i 是纯数字，所以他的类型会是 float，但是，如果抓取的单元格里面的值包含有字母，就会是 unicode 类型，则不可以使用这样的方法了

for i in issuekeyset:            if isinstance(i, float):                self.__issuekey.append( int(i) )

（之所以将 i 转换为 int 类型是因为后面在网页上抓取信息的搜索关键字需要使用 int 类型）

isinstance()方法是一个built-in function，所以查看 python 的官方文档就可以了解用法，当然，对于这种简单的判断是否是 float 的实例实际上可以使用type()

当遍历每个 sheet 的要求列并且写入到对应列表，必然会产生一些重复，要去重：

issuekeyset = set(self.__issuekey)  #duplicate removal

Web info 获取：

主要依赖于 JIRA 提供的 API 获取 issue 的相关数据。

逻辑处理

主要是将获得的信息进行处理，将它变成 Excel 报告中希望的样子，主要包括：

——时间的格式（见下 Issue Arise & Solution during scripting）：

            if ( td.AmPm(createtime.hour) == td.AmPm(now.hour) ) and ( createtime.date() == now.date() ):                crashdetail[4] = 'new'            elif ( td.AmPm(updatetime.hour) == td.AmPm(now.hour) ) and ( updatetime.date() == now.date() ):                print 'time is ',now.hour, updatetime.hour                interval = now.hour - updatetime.hour                #if interval == 1:     !!!!carefull here, interval could be 0                if interval == 1 or interval == 0:                    crashdetail[4] = '1 hour'                else:                    crashdetail[4] = str(interval) + ' hours'

* 使用条件语句的条件请——从特殊到一般！

           if (createtime.date() == now.date()) and ( td.AmPm(createtime.hour) == td.AmPm( now.hour ) ):                print '********This is new bug, bug date is*******', createtime.date(),now.date()                bugdetail[4] = 'new'            else:                # calculate time differences                interval = (now.date() - updatetime.date()).days                if interval == 1:                    bugdetail[4] = '1 day'                elif interval == 0:                       if now.hour - updatetime.hour == 1:                           bugdetail[4] = '1 hour'                       else:                           bugdetail[4] = str(now.hour - updatetime.hour)+' hours'                else:                    bugdetail[4] = str(interval) + ' days'

* 使用条件语句的条件请——从特殊到一般！依次排除

——issue 的排序需要根据优先级排序，并且每一个优先级中新的 issue 需要在当前优先级置顶：

涉及到一个排序的问题，还好 Python 有 sort 函数可以不打乱原本顺序进行排序，所以我采取的方法——先将所有的 new 置顶，然后使用 sort 函数对优先级排序。

new 置顶的思路采取的是选择排序：

    def sortNew(self,issuedetailist):        sorted = []        for i in range(len(issuedetailist)):            newindex = self.findNew(issuedetailist)            sorted.append(issuedetailist[newindex])            issuedetailist.pop(newindex)        issuedetailist = sorted + issuedetailist        print issuedetailist        return issuedetailist

实际上也可以不使用 sort() 进行排序，思路很多，无论是手工实现 sort()，还是说先进行优先级排序，然后对每一个优先级进行 new 置顶，都可以尝试。在这里选择最简单的节省时间……

—— issue 发生次数统计：

要判定是本团队的 tester 写的，且时间是同一个半天

            if (updatetimectu.date() == now.date() )and ( td.AmPm(now.hour) == td.AmPm(updatetimectu.hour) ):                for tester in testers:                    if tester in cmt.getBody():                        occur += 1

Excel 写入

Excel 在目前并没有想出能够直接使用 python 写入到复杂结构 Excel 并保存的方式。

只能通过让 Python 生成中间 csv 文件结合VBA 将文件写入到 Excel。

因为需要优化—— 一键生成报告，所以在VBA中加入调用 Python 脚本的 Sub：

Private Sub OpenEXE()    Dim path As String    path = Application.ActiveWorkbook.path & "/AutoReport.exe"    Shell path, 1End Sub

于是这样就会起冲突——在打开的 Excel 文件中执行打开 Python 脚本，Python 脚本又要打开该 Excel，这样会报错，所以需要创建一个副本，Python 脚本打开该副本即可（采用‘阅后即焚’）：

import shutilshutil.copyfile(src_excel_file,dst_excel_file)excelfile = xlrd.open_workbook(dst_excel_file)   #open a excelos.remove(dst_excel_file) # 阅后即焚，先关！！！！后删

整个文件需要定义好路径！！！！！！！！！！！！！！！！！

路径很重要！！！！

默认路径不可靠！！！！

打包了exe很容易出错！！！

Issue Arise & Solution during scripting：

Q：根据 issue id 查找到了具体的 issue 信息，但是需要根据 issue 的优先级对 issue 的具体信息进行排序？

——还记得官方文档怎么介绍 sort 方法的吗？'Key Function' 下面的大量实例足以解决上面的问题。（简易中文让你明白）

使用这种 itemgetter() 这个方法的时候一定要导入库：

from operator import itemgetter

Q：时间判断——两个问题

1. issue 时区跟使用电脑时区不一致

* 引用datetime 的时候，需要写入代码：

from datetime import datetime

（直接写 import datetime 会报错：AttributeError: 'module' object has no attribute 'now' ）

1. ——时区不一致会影响到后续时间的处理，所以需要一个时区的转化：

先说一个比较繁琐的时区转换思路——将 issuetime 强制转化为 localtime：那么需要考虑到 issuetime 还有夏时令，转换之后还要注意天数的进位，天数会影响月份，月份还会影响年份……TOT总之相当的繁琐。
那还是换一下思路好了，使用同事逗哥的方法将会简单不少——直接获取当前的 issue 时区时间，当地创建时间 operate 当地当前时间，这样判断时间差就要容易很多，至于issue的新旧问题，总结一句就是观察生活寻找逻辑规律（report time = current time）

Q：关于修改Excel表的问题

脚本最开始的时候，使用了xlrd 来读取 Excel 中的内容，不能对 Excel 进行修改；后来有查阅到xlwt 可以写Excel，但是只能创建一个新的 Excel 文件，不能在原有基础上修改。

Python 操作 Excel 还真的不是那么简单的事情，Python 与 Excel 之间有很多工具可以使用，但是这些工具是有局限、有区别的。

对于本次脚本操作的Excel对象具有复杂的格式以及函数，之后就发现了一个词： win32com

还有一种选择——VBA：Excel中按Alt+F11 打开VBA 集成开发环境，使用Excel 来读取捕获到的网页上的信息读入到 Excel。

Q：写入到的列表的对象是最后一个对象的值

# -*- coding: utf-8 -*-         cmts = []            cmt = Comment()   # 重复写一个对象！            comment_num = data['fields']['comment']['total']            if comment_num != 0:                for num in range(comment_num):                    #cmt = Comment()                    cmt.setTester(data['fields']['comment']['comments'][num]['updateAuthor']['displayName'])  #不断重复写同一个对象，同一个地址！并不能得到0-61的comment，只能得到最后一个，会覆盖                    cmt.setUpdatetime( data['fields']['comment']['comments'][num]['updated'])                    cmt.setBody(data['fields']['comment']['comments'][num]['body'])                    print cmt.getUpdatetime()                    print cmt.getBody()                    cmts.append(cmt) #指针指向同一个对象！重复覆盖！对象是赋值一个地址！！！

Q：打包成exe之后问题百出（使用视频录制软件截获一闪而过的exe报错）

程序运行报错——NameError：global_name'exit' is not defined ：在程序之中使用了 exit(), 打包之后并不会识别这个方法：（需要提前 import os,sys）

 def extraIssueKey(self):fname = []fname = glob.glob(Default_Path + '\\' + 'issue.txt')if len(fname) == 0:os.system('exit')else:issues_list = open(fname).read().split('\n')for line in issues_list:line = line.rstrip()print line,type(line)self.__issuekey.append( int(line) )

找不到路径——打包之后的exe起始路径会改变(从打包后的子目录进行——在脚本文件中设置打印语句，打包之后运行exe看当前路径)，他不会从当前文件夹下开始，所以在写程序的时候一定要加上路径，不要使用没有路径的文件（以为是放在统一目录下就ok）：

Default_Path = os.path.dirname(sys.path[0])def excelIssueKey(self):        os.chdir(Default_Path) # Now change the directory        dst_excel_file = 'Copy.xlsm'        filename = 'xxx.xlsm'        shutil.copyfile(filename,dst_excel_file)....def extraIssueKey(self):fname = []fname = glob.glob(Default_Path + '\\' + 'issue.txt')

Summary：

1. 一定要定义路径。

2. 如果需要使用外部文件，一定要做好文件可能不存在的准备（如果没有是否创建一个）

3. user

特别鸣谢：

doge

Matt

0 0