Python网络数据采集8(译者：哈雷)

来源：互联网发布：小米网络电话费流量吗编辑：程序博客网时间：2024/04/27 19:51

第六章读取文档
1.读取txt，这个非常简单

from urllib.request import urlopen  textPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")  print(textPage.read())

如果文件在本地，直接使用open()和read()函数即可读出所有内容，write()写入数据，readline()也是常用的函数
2.读取csv文件，这个我好像没怎么遇到过，就先不看了，以后遇到了再说吧。
3.读取pdf文档。这个需要安装一个额外的包pdfminer（地址为：https://pypi.python.org/pypi/pdfminer3k），下载解压后进入解压目录，python3 setup.py install 安装。不过这样读取的数据只能是字符串，图像不能显示，表格无法显示格式。

from urllib.request import urlopen  from pdfminer.pdfinterp import PDFResourceManager, process_pdf  from pdfminer.converter import TextConverter  from pdfminer.layout import LAParams  from io import StringIO  def readPDF(pdfFile):      rsrcmgr = PDFResourceManager()      retstr = StringIO()      laparams = LAParams()      device = TextConverter(rsrcmgr, retstr, laparams=laparams)      process_pdf(rsrcmgr, device, pdfFile)      device.close()      content = retstr.getvalue()      retstr.close()      return content  pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")  #pdfFile = open("chapter1.pdf", 'rb')#如果pdf文件在本地，则使用本语句  outputString = readPDF(pdfFile)  print(outputString)  pdfFile.close()

本文再提供一种方法，我认为这种方法简单易懂。上一种方法使用的是pdfminer的包，这个使用的是PyPDF2，代码如下

import PyPDF2  pdfFileObj = open('1.pdf', 'rb')  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)  print pdfReader.numPages#获取页数  pageObj = pdfReader.getPage(0)#第一页  print pageObj.extractText()

这个方法解释起来就相当简单，参照txt的读入方式请读者自行理解。

4.在windows下用python读取word文档比较简单，下载相应的包，教程非常多，读者自行google，在linux下

import docx  def getText(filename):      doc = docx.Document(filename)#创建doc对象      fullText = []      print len(doc.paragraphs)#获得doc对象中的段落数目      print doc.paragraphs[0].text：获得第一段的文字内容      print doc,paragraphs[0].runs[0]#获得第一段中的第一种字体的内容，例如段落中有圆体和斜体，则输出圆体的内容      for para in doc.paragraphs: #获得doc中的所有内容          fullText.append(para.text)      return '\n'.join(fullText)  getText("1.doc")

5.在linux下读取excel表格，读写需要xlwt和xlrd两个包，请读者自行下载安装

# -*- coding: utf-8 -*-   import xlwt  import xlrd  import os,sys,string  rootdir = '/home/name1'  rootdir_2 = '/home/name2'  for filename in os.listdir(rootdir):       filepath = rootdir+'/'+filename      data = xlrd.open_workbook(filepath)      book=xlwt.Workbook()#生成一个对象      sheet = book.add_sheet('sheet1',cell_overwrite_ok=True)#添加sheet      sheet.col(0).width=1000#设置第一列的表格宽度      sheet.write(0,0,'Name')#第一行第一列写入Name      table= data.sheets()[0]#第一张表      nrows= table.nrows#获得行数      for i in range(nrows):          try:              old_name = (table.row(i)[1].value)#获取每行单元的内容              new_name = old_name.replace(',',' ')#正则表达式除杂，此处可以写更多                           sheet.write(i,0,new_name)#写入          except Exception, e:              print e      book.save(rootdir_2+'/'+filename[:-1])#保存

0 0