Python解析txt文件、使用pdfminer解析pdf文件

来源：互联网发布：种族歧视知乎编辑：程序博客网时间：2024/05/17 22:06

一、读取TXT文档

urlopen()

一个简单的例子：读取https://en.wikipedia.org/robots.txt的内容

from urllib.request import urlopenhtml=urlopen("https://en.wikipedia.org/robots.txt")print(html.read().decode("utf-8"))

二、读取PDF文档

pdfminer3k

1.安装pdfminer模块

pip install pdfminer3k

2.在IDE中进行编码

一个简单的例子：读取本地PDF文件simple1.pdf。这个例子中，simple1.pdf实现加入了项目文件夹下。如果读取网络上的PDF文档，那么注释掉原来的fp=open..改为fp=urlopen...并且 from urllib.request import urlopen

from pdfminer.converter import PDFPageAggregatorfrom pdfminer.layout import LAParamsfrom pdfminer.pdfparser import PDFParser, PDFDocumentfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.pdfdevice import PDFDevicefrom urllib.request import urlopen#Open a PDF file.#rb是指以二进制读的形式打开#fp=open("simple1.pdf","rb")fp=urlopen("http://www.tencent.com/zh-cn/articles/8003251479983154.pdf")#Create a PDF parser object associated with the file object.parser=PDFParser(fp)#Create a PDF document object that stores the document structuredoc=PDFDocument()#Connect the parser annd document objects.parser.set_document(doc)doc.set_parser(parser)# Supply the password for initialization.# (If no password is set, give an empty string.)doc.initialize("")#创建PDF资源管理器resource=PDFResourceManager()#参数分析器laparam=LAParams()#创建一个聚合器device=PDFPageAggregator(resource,laparams=laparam)#创建PDF页面解释器interpreter=PDFPageInterpreter(resource,device)#使用文档对象从页面读取内容for page in doc.get_pages():    #使用页面解释器来读取    interpreter.process_page(page)    #使用聚合器来获取内容    layout=device.get_result()    for out in layout:        if hasattr(out,"get_text"):            print(out.get_text())

阅读全文

0 0