爬虫：将廖雪峰网上资源保存为本地PDF文件

来源：互联网发布：qq企业邮箱 smtp 端口编辑：程序博客网时间：2024/06/16 09:21

偶然中看到[Python之禅]的推文，关注公众号之后，发现里面好多有趣的东西，于是按照作者的讲解，打算自己亲自去试一试！

1. 准备工作：

1.1 分析网站结构：

网址：廖雪峰Python教程
分析：
网页的左侧是教程的目录大纲，每个URL对应到右边的一篇文章，右侧上方的是标题，中间是文章的正文部分，正文内容是我们关心的重点，我们要爬取的数据就是所有网页的正文部分，下方是用户评论区，评论区对我们没什么用，我们可以忽略它。
这里写图片描述

1.2 工具准备：

Requests和beautifulsoup是爬虫的两大神器，requests用于网络请求，beautifulsoup用于操作html数据。要把 html 文件转为 pdf，要有相应的库支持， wkhtmltopdf 就是一个非常好的工具，它可以用适用于多平台的 html 到 pdf 的转换，pdfkit 是 wkhtmltopdf 的Python封装包。

1.2.1 安装pip，如果在安装Python时没有选择安装该包

参考Python的包管理工具pip的安装与使用

python get-pip.py

1.2.2 安装requests

pip install requests

1.2.3 安装beautifulsoup

pip install beautifulsoup

出现下面的错误：
这里写图片描述
从打印结果可以看出，beautifulsoup中的内容支持python2，不支持python3。
解决方案：安装beautifulsoup4

1.2.4 安装requests：安装pdfkit

pip install pdfkit

1.2.5 下载并安装wkhtmltopdf

下载地址：wkhtmltopdf
安装完成后，将安装目录添加至系统path中。

2. 爬虫实现：

程序的目的是，要把所有的URL对应的html正文部分保存到本地，然后利用pdfkit把这些文件转换成一个pdf文件。

将某一个URL对应的html正文保存到本地
找到所有的URL执行相同的操作

用Chrome浏览器找到页面正文部分的标签，按F12找到正文对应的div标签：<div class=”x-wiki-content”>，该div是网页的正文内容，用requests把整个页面加载到本地后就可以使用beautifulsoup操作HTML的dom元素来提取正文内容了。
这里写图片描述

3. Python 实现：

下载了作者的源码并做了一些整理和注释！

3.1 存在的问题及解决方案：

找不到文件
Configuration中的wkhtmltopdf赋值错误！
修改方式：
configuration.py中，self.wkhtmltopdf = wkhtmltopdf修改为
协议未知错误

可以得到爬虫结果，但是仍然存在此问题！

3.2 源码：

# coding=utf-8from __future__ import unicode_literalsimport loggingimport osimport timeimport retry:    from urllib.parse import urlparse  #py3except:    from urlparse import urlparse #py2import pdfkitimport requestsfrom bs4 import BeautifulSouphtml_template = """<!DOCTYPE html><html lang="en"><head>    <meta charset="UTF-8"></head><body>{content}</body></html>""""""爬虫基类，所有的爬虫都应该继承此类"""class Crawler(object):    name = None    """    初始化    :param name:保存的PDF文件名，不需要后缀名    :param  start_url:爬虫入口URL    """    def __init__(self, name, start_url):        self.name = name        self.start_url = start_url        self.domain = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(self.start_url))        """        """    def crawl(self, url):        print(url)        response = requests.get(url)        return response    """    解析目录结构，获取所有URL目录列表，由子类实现    ：param response 爬虫返回的response对象    ：return url 可迭代对象（iterable）列表，生成器，元组都可以    """    def parse_menu(self, response):        raise NotImplementedError    """    解析正文，由子类实现    ：param response：爬虫返回的response对象    ：return 返回经过处理的html文本    """    def parse_body(self, response):        raise NotImplementedError    def run(self):        start = time.time()        # options 设置PDF格式        options = {              'page-size': 'Letter',            'margin-top': '0.75in',            'margin-right': '0.75in',            'margin-bottom': '0.75in',            'margin-left': '0.75in',            'encoding': "UTF-8",            'custom-header': [                ('Accept-Encoding', 'gzip')            ],            'cookie': [                 ('cookie-name1', 'cookie-value1'),                 ('cookie-name2', 'cookie-value2'),            ],            'outline-depth': 10,        }        #将menu对应的html解析出来，保存为html文件        htmls = []        for index,url in enumerate(self.parse_menu(self.crawl(self.start_url))):            html = self.parse_body(self.crawl(url))            f_name = ".".join([str(index),"html"])            with open(f_name,'wb') as f:                f.write(html)            htmls.append(f_name)        pdfkit.from_file(htmls, self.name+".pdf", options=options)        for html in htmls:            os.remove(html)        total_time = time.time() - start        print(u"总共耗时：%f 秒" % total_time)"""子类：爬虫廖雪峰的Python3教程"""class LiaoXueFengPythonCrawler(Crawler): #括号，表示继承    """    完善目录解析函数,获取所有URL目录列表    ：param response 爬虫返回的response对象    ：return url生成器    """    def parse_menu(self, response):        soup = BeautifulSoup(response.content, "html.parser")        menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]        for li in menu_tag.find_all("li"):            url = li.a.get("href")            if not url.startswith("http"):                url = "".join([self.domain, url])    #补全为全路径            yield url    """    完善正文解析函数，    ：param response：爬虫返回的response对象    ：return 返回处理后的html文本    """    def parse_body(self, response):        try:            soup = BeautifulSoup(response.content, 'html.parser')            body = soup.find_all(class_="x-wiki-content")[0]            #加入标题，居中显示            title = soup.find('h4').get_text()            center_tag = soup.new_tag("center")            title_tag = soup.new_tag('h1')            title_tag.string = title            center_tag.insert(1,title_tag)            body.insert(1,center_tag)            html = str(body)            #body中的img标签的src相对路径改成绝对路径            pattern = "(<img .*?src=\")(.*?)(\")"            def func(m):                if not m.group(3).startswith("http"):                    rtn = "".join([m.group(1), self.domain, m.group(2), m.group(3)])                    return rtn                else:                    return "".join([m.group(1), m.group(2), m.group(3)])            html = re.compile(pattern).sub(func, html)            html = html_template.format(content=html)            html = html.encode("utf-8")            return html        except Exception as e:            logging.error("解析错误", exc_info=True)if __name__ == '__main__':    start_url = "http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"    crawler = LiaoXueFengPythonCrawler("廖雪峰blogs", start_url)    crawler.run()