一只爬虫的产生

来源：互联网发布：seo工作一年感想编辑：程序博客网时间：2024/05/08 07:56

笔者闲间之余，突发奇想突然想接触下爬虫，于是开始到处找教程，最后某度了几个比较典型的爬虫，但是还是相对混乱，对于一个从没基础过爬虫的来说，着实有点吃力，于是在视频教程上找了个python的爬虫，其中有些笔记记录在这，仅供一起学习。

以下环境基于py2.7

爬虫架构：

URL管理器:处理待爬url以及爬过的url，防止重复抓取以及死循环

网页下载器：下载整个网页保存为字串例如：urllib2,requests

网页解析器：解析出想要的数据，以及捕捉新的url地址交给URL管理器进行处理继续抓取。过滤数据，拿到有价值的数据进行处理。

数据的存放：

python 的 set集合可以防止数据的重复

需要长期保存的话存放到关系型数据库

需要性能存放到缓存数据库

网页下载器的三种使用方法：

urllib2.urlopen(url)

最简单：

import urllib2response = urllib2.urlopen(url) # 打开urlresponse.getcode() # 获得状态码

带上参数：

request = urllib2.Request(url)request.add_data('name','value') #数据request.add_header('user-agent','...')#模拟浏览器头部访问response.urllib2.urlopen(url)

带上cookie、代理、https、重定向等：

HTTPCookieProcessor、ProxyHandler、HTTPSHandler、HTTPRedirectHandlerimport urllib2,cookielibcj = cookielib.CookieJar() #创建cj容器opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) # 安装opener容器response = urllib2.urlopen(url) 带着cookie访问URL

解析器：regex、html.parser、BeautifulSoup、lxml

一般使用bs4 Beautiful Soup 4

1、创建bs对象

soup = BeautifulSoup(html_doc,#文档字串'html.parser', # html解析器from_encoding='utf-8' #html文档编码)

2、搜索节点
sou.find_all(标签名,属性,字串)#可以使用正则直接搜索

exp:

# coding:utf8from bs4 import BeautifulSoupimport rehtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')links = soup.find_all('a')# for link in links:# print link.name,link['href'],link.get_text()# print 'only lacie'# link = soup.find('a',href='http://example.com/lacie')# print link# print 'regex start....'# reg = soup.find('a',href=re.compile(r'ill'))# print reg.get_text()# print 'p'# p_node = soup.find('p',class_=re.compile(r"s"))# print p_node.get_text()

每个语句执行结果自行测试哟

以上内容摘自慕课网讲师ppt

0 0