python 实现简单爬虫

来源：互联网发布：当当读书网络连接失败编辑：程序博客网时间：2024/06/06 08:35

目标：

掌握开发轻量级爬虫
只考虑不需要登录的静态网页抓取

内容：

实现方式：

内存：Python内存：待爬取URL集合：set（）；已爬取URL集合：set（）
关系数据库：MySQL：urls(url,is_crawled)
缓存数据库：redis：待爬取URL集合set；已爬取URL集合set

网页下载器：将互联网上URL对应的网页下载到本地的工具
分类：urllib2：Python官方基础模块
Requests:第三方包更强大

Urllib2下载网页方法1：最简洁方法

import urllib# 直接请求Response = urllib.urlopen('http://www.baidu.com')# 获取状态码，如果是200表示获取成功print(Response.getcode())# 读取内容cont = Response.read()

Urllib2下载网页方法2：添加data、http header

import urllib# 创建request对象request = urllib.request(url)# 添加数据request.add_data('a', '1')# 添加http的headerrequest.add_header('User-Agent', 'Mozilla/5.0')# 发送请求获取结果request = urllib.urlopen(request)

urllib2下载网页方法3：添加特殊情景的处理器

网页解析器
网页解析器：从网页中提取有价值数据的工具
Python有有哪几种网页解析器？
正则表达式——模糊匹配
html.parser | BeautifulSoup | lxml——结构化解析

from bs4 import BeautifulSoup# 根据HTML网页字符串创建BeautifulSoup对象soup = BeautifulSoup(    html_doc ,        # HTML文档字符串    'html.psrser',    # HTML解析器    from_encoding='utf8'  # HTML文档的编码)# 方法：find_all(name, attrs, string)# 查找所有标签为a的节点soup.find_all('a')# 查找所有标签为a，链接符合/view/123.htm形式的节点soup.find_all('a', href='/view/123.htm')soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))# 查找所有标签为div，class为abc，文字为Python的节点soup.find_all('div', class_='abc', str='python')访问节点信息# 得到节点：<a herf='1.html'>python</a># 获取查找到的节点的标签名称node.name# 获取查找到的啊节点的herf属性node['herf']

实例测试

from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'html.parser', from_encoding='utf-8')print('获取搜有的链接')links = soup.find_all('a')for link in links:    print(link.name, link['href'],link.get_text())print('获取lacie的链接')link_node = soup.find('a', href='http//examole.com/lacie')print(link.name, link['href'], link.get_text())print('正则匹配')link_node = soup.find('a', href=re.compile(r"ill"))print(link_node.name, link['href'], link_node.get_text())print('获取p段落文字')p_node = soup.find('p', class_="title")print(p_node.name,  p_node.get_text())

实例爬虫

确定目标——分析目标（URL格式、数据格式、网页编码）——编写代码——执行爬虫
目标：百度百科python词条相关词条网页—标题和简介
入口页：http://baike.baidu.com/view/21078.htm
URL格式：
——词条页面URL：/view/125370.htm
数据格式：
——标题：

     <dd class=”lemmaWgt-lemmaTitle-title”><h1>***</h1></dd>

——简介：

    <div class=”lemma-summary”>***<div>

页面编码：UTF-8

阅读全文

2 0