python 读取 pdf 文档

来源：互联网发布：mysql root用户无权限编辑：程序博客网时间：2024/05/17 05:02

这个图片是使用的流程说明，看着是有点绕的，分解来看（学自慕课）

首先使用 open 方法或者 urlopen 打开本场文档或者网络文档（一般会这么做因为考虑到文档太大，对网络服务器负担也很大）生成文档对象，以下的方法之中的网络链接已经存在了

# 获取文档对象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')

接着创建 文档解析器 和 PDF文档对象 并将他们相互关联

# 创建一个与文档关联的解析器
parser = PDFParser(pdf0)
# 创建一个PDF文档对象
doc = PDFDocument()
# 连接两者
parser.set_document(doc)
doc.set_parser(parser)

对 PDF文档对象 进行初始化，如果文档本身进行了加密，则需要在加入 password 参数

# 文档初始化
doc.initialize('')

先创建 PDF资源管理器 和 参数分析器

# 创建PDF资源管理器
resources = PDFResourceManager()
# 创建参数分析器
laparam = LAParams()

再创建一个 聚合器 ，并接收 PDF资源管理器 参数分析器 作为参数

# 创建一个聚合器，并接收资源管理器，参数分析器作为参数
device = PDFPageAggregator(resources,laparams=laparam)

最后创建一个 页面解释器 ，将 PDF资源管理器 和 聚合器 作为参数

# 创建一个页面解释器
interpreter = PDFPageInterpreter(resources,device)

这样 页面解释器 就具有对PDF文档进行编码，解释成Python能够识别的格式

最后呢，使用 PDF文档对象 的 get_pages()方法 从PDF文档中读取出页面集合，接着使用 页面解释器 对页面集合逐一读取，再调用 聚合器 的 get_result()方法 将页面逐一放置到 layout 之中，最后商用 layout 的 get_text()方法 获取每一页的 text

for page in doc.get_pages():
# 使用页面解释器读取页面
interpreter.process_page(page)
# 使用聚合器读取页面页面内容
layout = device.get_result()
for out in layout:
if hasattr(out,'get_text'): # 因为文档中不只有text文本
pprint(out.get_text())

需要注意的是在PDF文档中不只有 text 还可能有图片等等，为了确保不出错先判断对象是否具有 get_text()方法

完整的代码

# encoding:utf-8
'''
@author:
@time:
'''
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pprint import pprint
from urllib.request import urlopen
# 获取文档对象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')
# 创建一个与文档关联的解释器
parser = PDFParser(pdf0)
# 创建一个PDF文档对象
doc = PDFDocument()
# 连接两者
parser.set_document(doc)
doc.set_parser(parser)
# 文档初始化
doc.initialize('')
# 创建PDF资源管理器
resources = PDFResourceManager()
# 创建参数分析器
laparam = LAParams()
# 创建一个聚合器，并接收资源管理器，参数分析器作为参数
device = PDFPageAggregator(resources,laparams=laparam)
# 创建一个页面解释器
interpreter = PDFPageInterpreter(resources,device)
# 使用文档对象获取页面的集合
for page in doc.get_pages():
# 使用页面解释器读取页面
interpreter.process_page(page)
# 使用聚合器读取页面页面内容
layout = device.get_result()
for out in layout:
if hasattr(out,'get_text'): # 因为文档中不只有text文本
pprint(out.get_text())

素材选取是官方提供的

运行的结果：

'Preemptive Information Extraction using Unrestricted Relation Discovery\n'
'Yusuke Shinyama\n'
'Satoshi Sekine\n'
'New York University\n715, Broadway, 7th Floor\nNew York, NY, 10003\n'
'{yusuke,sekine}@cs.nyu.edu\n'
'Abstract\n'
('We are trying to extend the boundary of\n'
'Information Extraction (IE) systems. Ex-\n'
'isting IE systems require a lot of time and\n'
'human effort to tune for a new scenario.\n'
'Preemptive Information Extraction is an\n'
'attempt to automatically create all feasible\n'
'IE systems in advance without human in-\n'
'tervention. We propose a technique called\n'
'Unrestricted Relation Discovery that dis-\n'
'covers all possible relations from texts and\n'
'presents them as tables. We present a pre-\n'
'liminary system that obtains reasonably\n'
'good results.\n')

阅读全文

0 0