Web Scraping with Python: 使用 Python 下载 CSDN 博客图片

来源：互联网发布：火猫解说收入知乎编辑：程序博客网时间：2024/05/16 12:38

一、引言

最近一直在学习 Python 的网络爬虫技术，这期间两本书在同时看：

《Web Scrapying with Python》
《精通 Scrapy 网络爬虫》

而今天受到启发的就是《Web Scrapying with Python》 P115 页的这段代码：

from urllib.request import urlretrieve from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com") bsObj = BeautifulSoup(html) imageLocation = bsObj.find("a", {"id": "logo"}).find("img")["src"] urlretrieve (imageLocation, "logo.jpg")

这段代码非常简洁清晰的演示了 urllib.request.urlretrieve 方法的使用，即用来下载远程 url 的资源文件。

接下来，我使用这个核心函数，加上 BeautifulSoup 库，实现自己的 CSDN 博客文章中附带图片的下载。

二、设计

要实现一个爬虫程序，我们先要明确我们的目标：

爬取用户的所有博客文章，获取其标题，以其标题为文件夹名称创建一个文件夹用来容纳该博客内的附带图片
在每个以博客文章标题为名的文件夹中，放置我们下载的图片资源

为了实现这个需求，我们需要回答自己的，也就是以下几个核心问题：

1. 如何爬取用户的所有博客文章？

其中根据我在 Chrome 和 scrapy shell 中测试调试的结果：

# 文章列表和文章链接：http://blog.csdn.net/u012814856articles = bsObj.findAll('div', {'class': 'article_item'})link = article.h1.a.attrs['href']# 文章标题和图片：http://blog.csdn.net/u012814856/article/details/78370952title = bsObj.h1.get_text()images = bsObj.find('div', {'class': 'article_content'}).findAll('img')

根据以上信息，我们可以编写相应的抓取逻辑将相关的重要信息抓取出来

2. 如何实现多个页面的跳转？

其中根据我在 Chrome 和 scrapy shell 中测试的结果可知

# 下一页的路径next_url = bsObj.find('a', text='下一页').attrs['href']

根据以上信息，我们可以实现多个页面的跳转

3. 如何实现突破下载?

这个只需要使用我们前言中提到的函数 urllib.request.urlretrieve 即可

4. 如何规避特殊字符不能命名文件夹的问题？

这是一个非常有趣的问题，比如说我的一篇博客标题：

Scrapy 小技巧：选择器（Selectors）怎么写

这个标题就不能命名为文件夹的名称，因为其包含了 :。让我们看看 Windows 中哪些特殊字符不能包含在文件夹命名中：
deny chars
我们只需要对每一个标题将其中的全部特殊字符替换为空即可：

# Replace deny char, used to name a directory.def replace_deny_char(title):    deny_char = ['\\', '/', ':', '*', '?', '\"', '<', '>', '|', '：']    for char in deny_char:        title = title.replace(char, ' ')    print('Convert title is: %s' % title)    return title

上述这个函数使用了 for 循环，轻松完成这个任务。

5. 如何创建文件夹？

这个问题是一个 Python 基础的问题了，我也是现场查资料学习的。
其中这篇博客短小精悍，涉及到的 3 个函数都有演示：
Python创建目录文件夹

当我们回答了以上几个关键问题，我们也就具备了实现这个需求的能力了，接下来看看实现吧：）

三、实现

还是直接秀出我的代码吧：

# -*- coding:utf-8 -*-from urllib.request import urlopenfrom urllib.request import urlretrievefrom urllib.parse import urljoinfrom bs4 import BeautifulSoupimport os# Parse one page's articles.def parse_page(bsObj, url):    articles = bsObj.findAll('div', {'class': 'article_item'})    links = []    for article in articles:        links.append(article.h1.a.attrs['href'])        print(links[-1])        parse_article(urljoin(url, links[-1]))# Parse one article.def parse_article(url):    # 1. Open article site.    html = urlopen(url)    bsObj = BeautifulSoup(html, 'html.parser')    # 2. Parse title and create directory with title.    title = bsObj.h1.get_text()    print('Article title is: %s' % title)    directory = 'CSDN Blog/%s' % replace_deny_char(title)    if os.path.exists(directory) is False:        os.makedirs(directory)    # 3. Parse and download images.    images = bsObj.find('div', {'class': 'article_content'}                        ).findAll('img')    count = 0    for img in images:        count += 1        imgUrl = urljoin(url, img.attrs['src'])        print('Download image url: %s' % imgUrl)        urlretrieve(imgUrl, '%s//%d.jpg' % (directory, count))# Replace deny char, used to name a directory.def replace_deny_char(title):    deny_char = ['\\', '/', ':', '*', '?', '\"', '<', '>', '|', '：']    for char in deny_char:        title = title.replace(char, ' ')    print('Convert title is: %s' % title)    return titleprint('Please input your CSDN name:')name = input()url = 'http://blog.csdn.net/%s' % namewhile True:    # 1. Open new page.    html = urlopen(url)    bsObj = BeautifulSoup(html, 'html.parser')    print('Enter new page: %s' % url)    # 2. Crawl every article.    parse_page(bsObj, url)    # 3. Move to next page.    next_url = bsObj.find('a', text='下一页')    if next_url is not None:        url = urljoin(url, next_url.attrs['href'])    else:        break