使用爬虫进行一次 hexo 构建的博客爬取并且生成 md 文档

来源：互联网发布：淘宝cos店三编辑：程序博客网时间：2024/05/16 08:49

起因

由于以前的博客文章在电脑重装的时候全没了，直接 cv 战士难免太过麻烦，正好好久没有写 python 了，于是决定写一个爬虫来爬取文章并且生成 md 文档

分析

使用的技术和库

这里使用 python + BeautifulSoup4（网页装载与解析） + urllib（发起请求） + codecs（写入文件）

主页

我们来看看主页，一篇文章的位置
这里写图片描述

再来看看所有文章是怎么分布的
这里写图片描述
这简直就是最简单的 list 结构嘛

分页

文章不可能就只有一页，所以对分页的研究就体现在分页的 url 上，这样我们就能狗一次爬到底
看看第二页的url
这里写图片描述
推断一下，第 6 页应该是 http://wintersmilesb101.online/page/6

果然没错
那么看看第一页是否可以写成 http://wintersmilesb101.online/page/1 呢？
这里写图片描述

说明首页需要特殊处理，即 http://wintersmilesb101.online
这里写图片描述

抓取页面大小

页面大小的 dom 结构如下
这里写图片描述

可以看到，这几个页面 index 的 class 是一致的，所以我们需要通过 BeautifulSoup 来选中上一个元素（这里可以看出上一个是这个结构中唯一的），或者是通过 BeautifulSoup 的 select 方法选中 class = page-number 的元素列表，最后一个即为 pageSize 的元素

文章信息

我们需要哪些文章信息？
由于我们这里是使用的 hexo 来构建的博客，所以要按照他的规则来，一般来说我们需要如下结构

---title: Python3.7 爬虫（二）使用 Urllib2 与 BeautifulSoup 抓取解析网页date: 2017-04-08date: 2017-04-09categories: - 爬虫- Python 爬虫tags: - Python3- 爬虫- Urllib2- BeautifulSoup4---

这些信息，除了标签，我们都可以在文章列表页面就获取到了，如下：
这里写图片描述

当然这些信息，正文也页都有，正文页的链接，我们可以在 title 的位置获取到，与网站基础 url 拼接就可以获取到最终链接，不过有些 url 中有中文，因此我们需要使用 urllib.request.quote(link) 来把链接中的中文编码成 url 中的正确编码，这里会把 : 也转码了，转换成 %3A 因此，转换之后，我们还需要还原 %3A 为 :

正文的转换

正文就直接通过获取到 class = post-body的元素，然后遍历子元素（通过 children 属性，注意 type 为 bs4.element.NavigableString 的元素，是无效元素，需要跳过），然后根据，html 与 markdown 的对应关系来转换成对应的 markdown 写法，不过在 BeautifulSoup 中还是有不少的坑点，这里代码中注释写的很清楚，就不赘述了

实现

import urllib.requestfrom bs4 import BeautifulSoupimport bs4import reimport xlwtimport osimport codecsfilePath = r"H:/GIT/Blog/WinterSmileSB101/source/_posts/old/"url = "http://wintersmilesb101.online"user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'req = urllib.request.Request(url, headers={    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})print('发送 页面网络请求')response = urllib.request.urlopen(req)content = response.read().decode('utf-8')# output content of page#print(content)soup = BeautifulSoup(content, "lxml")# 获取页面数量spans = soup.select('span.space')pageHref = spans[spans.__len__()-1].nextSibling['href']# get total numpageNum = int(pageHref.split('/')[2])print(pageNum)# get other pageurlBase = "http://wintersmilesb101.online/page/"index = 1while index <= pageNum:    # 索引大于 1 的时候需要重新指定 url    if index > 1:        url = urlBase+str(index)        print(url)        req = urllib.request.Request(url, headers={            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'        })        print('发送 页面网络请求 : '+url)        response = urllib.request.urlopen(req)        content = response.read().decode('utf-8')        soup = BeautifulSoup(content, "lxml")    # 获取文章 list    articles = soup.find_all('article')    # 处理每篇文章    for article in articles:        # 获取创建时间        createTime = article.find('time', title="创建于").text.strip()        # 获取创建时间        updateTime = article.find('time', title="更新于").text.strip()        # 获取分类        categoies = article.find_all('a', attrs = {'itemprop': "url", 'rel': "index"})        # 分类的 url，Name        categoryUrl = ''        categoryName = ''        for category in categoies:            #print(category)            categoryUrl += category['href']+','            #print(categoryUrl)            categoryName += category.text.strip()+','            #print(categoryName)        categoryUrl = categoryUrl[0:categoryUrl.__len__()-1]        categoryName = categoryName[0:categoryName.__len__() - 1]        # 获取正文        urlMain = ''        link = article.link['href']        articleTitle = link.split('/')[link.split('/').__len__()-2]        # print(articleTitle)        # 转换中文 url 编码        urlMain = urllib.request.quote(link)        # 把多余的转换 : ==> %3A ，还原        urlMain = urlMain.replace('%3A', ':')        # print(urlMain)        print('发送 文章网络请求')        req = urllib.request.Request(urlMain, headers={            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'        })        response = urllib.request.urlopen(req)        mainContent = response.read().decode('utf-8')        # output content of page        # print(mainContent)        mainSoup = BeautifulSoup(mainContent,'lxml')        body = mainSoup.find('div', itemprop="articleBody")        blockquote = body.blockquote        if blockquote != None:            blockquoteText = blockquote.p.text            # print(blockquote.p)            extenalUrl = None            mineUrl = blockquote.p.a['href']            if blockquote.p.find('a', rel="external"):                extenalUrl = blockquote.p.find('a', rel="external")['href']            # print(extenalUrl)            # 把其中的链接替换为 md 语法            if extenalUrl:                blockquoteText = blockquoteText.replace("原文地址", "[原文地址](" + extenalUrl + ")")            blockquoteText = blockquoteText.replace(mineUrl, "[" + mineUrl + "](" + mineUrl + ")")        # 获取标签        tags = mainSoup.find_all('a', rel='tag')        # print(tags)        # 写入 md 文件        # 判断路径是否存在        if not os.path.exists(filePath + str(index) + '/'):            os.makedirs(filePath + str(index) + '/')        file = codecs.open(filePath + str(index) + '/' + articleTitle + '.md', "w", encoding='utf8')  # 指定文件的编码格式        # 写入前置申明        file.write('---\n')        file.write("title: " + articleTitle + '\n')        file.write("date: " + createTime + '\n')        file.write("date: " + updateTime + '\n')        file.write("categories: " + '\n')        for category in categoryName.split(','):            file.writelines('- ' + category + '\n')        file.writelines("tags: ")        for tag in tags:            tag = tag.text.replace('# ', '')            file.writelines('- ' + tag + '\n')        file.writelines('---' + '\n')        # 写入引用块        if blockquote != None:            file.writelines('> ' + blockquoteText)            # 遍历正文块，写入文件,注意遍历文档树的时候 next_sibling 是紧紧接着的，比如这里是 \n,所以需要两个            # print(blockquote.next_sibling.next_sibling)        for nextTag in body.children:            # print(nextTag)            # print(type(nextTag))            if type(nextTag) == bs4.element.NavigableString:                continue            tagName = ''            codeType = ''            codeStart = ''            codeEnd = ''            tagContent = nextTag.text.strip()            if nextTag.name == 'h1':                tagName = '# '                file.write(tagName + tagContent + '\n')                continue            if nextTag.name == 'h2':                tagName = '## '                file.write(tagName + tagContent + '\n')                continue            if nextTag.name == 'h3':                tagName = '### '                file.write(tagName + tagContent + '\n')                continue            if nextTag.name == 'h4':                tagName = '##### '                file.write(tagName + tagContent + '\n')                continue            # 代码块            if nextTag.select('figure').__len__() > 0 or nextTag.name == 'figure':                # 如果 select 的 length 大于 0 则表示这个元素是 包含 figure 的元素                if nextTag.select('figure').__len__() > 0:                    nextTag = nextTag.select('figure')[0]                codeType = nextTag['class'][nextTag['class'].__len__() - 1] + '\n'                codeStart = '``` '                codeEnd = '```\n'                codeLine = ''                lineNumber = nextTag.table.tr.find('td', attrs={'class': 'gutter'}).text                code = nextTag.table.tr.find('td', attrs={'class': 'code'}).text                tagContent = tagContent.replace(lineNumber, '').replace(code, '')                # print(lineNumber)                # print(code)                # print(tagContent)                for line in nextTag.table.tr.find('td', attrs={'class' : 'code'}).find_all('div'):                    codeLine += line.text.strip()+'\n'                file.write(tagContent+'\n')                file.write(codeStart + codeType + codeLine + '\n' + codeEnd)                continue            # 无序列表            if nextTag.name == 'ul':                for li in nextTag.find_all('li'):                    file.write('- ' + li.text.strip() + '\n')                    continue            # 有序列表            if nextTag.name == 'ol':                olIndex = 1                for li in nextTag.find_all('li'):                    file.write(olIndex + '. ' + li.text.strip() + '\n')                    olIndex += 1                continue            if nextTag.name == 'p':                # 为空表示是图片                tagContent = nextTag.text.strip()                if tagContent == '':                    file.write("![image](" + nextTag.find('img')['src'] + ")\n")                    continue                else:                    links = nextTag.find_all('a')                    for link in links:                        tagContent = tagContent.replace(link.text, "[" + link['href'] + "](" + link['href'] + ")")                    file.write(tagContent + '\n')                    continue        file.close()    index = index+1

效果

第一页的文章

这里写图片描述

第二页的文章

这里写图片描述

第一篇文章，感觉效果还是不错的
这里写图片描述

代码

文章所有代码已经提交到 git

如有问题，希望不吝赐教！

阅读全文

0 0