学习python写网络爬虫(二)

来源:互联网 发布:ug线切割编程 编辑:程序博客网 时间:2024/03/29 14:08

通过网站地图爬取网站
Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。
csdn的robots.txt:http://www.csdn.net/robots.txt
在robots.txt里面有csdn的sitemap的地址

import urllib2import redef download(url, user_agent = 'wswp', num_retries = 2):    print 'Downloading:',url    headers = {'User-agent':user_agent}    request = urllib2.Request(url,headers=headers)    try:        html = urllib2.urlopen(request).read()    except urllib2.URLError as e:        print 'Download error:',e.reason        html = None        if num_retries > 0:            if hasattr(e,'code') and 500 <= e.code < 600:                return download(url,num_retries-1)    return htmldef crawl_sitemap(url):    sitemap = download(url)    links = re.findall('<loc>(.*?)</loc>',sitemap)    #links是一个url的列表    for link in links:        #html = download(link)        print linkcrawl_sitemap('http://example.webscraping.com/sitemap.xml')def dlsm(url):    sitemap = download(url)    links = sitemap.split('\n')    for ls in links:        print ls#这是csdn的sitemapdlsm('http://www.csdn.net/article/sitemap.txt')

通过ID爬取
http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/Albania-3
http://example.webscraping.com/view/American-Samoa-5
这些url只在结尾处有区别,结尾包括国家名字和id,有时候服务器会忽略id之前那个字符串,只使用id来匹配数据库中的相关记录
http://example.webscraping.com/view/5和http://example.webscraping.com/view/American-Samoa-5访问的是相同的页面

# coding=utf-8import urllib2import itertoolsdef download(url, user_agent='wswp', num_retries=2):    print 'Downloading:', url    headers = {'User-agent': user_agent}    request = urllib2.Request(url, headers=headers)    try:        html = urllib2.urlopen(request).read()    except urllib2.URLError as e:        print 'Download error:', e.reason        html = None        if num_retries > 0:            if hasattr(e, 'code') and 500 <= e.code < 600:                return download(url, num_retries - 1)    return htmldef crawlId():    for page in itertools.count(1):        url = 'http://example.webscraping.com/view/%d' % page        html = download(url)        if html is None:            break        else:            print htmlcrawlId()# 这个是crawlId的增强版def crawlWitdId():    maxErrors = 5    numErrors = 0    for page in itertools.count(1):        url = 'http://example.webscraping.com/view/%d' % page        html = download(url)        print html        # 数据库id之间不一定是连续的,所以加一个判断        # 使爬虫连续5次下载错误才会停止        if html is None:            numErrors += 1            if numErrors == maxErrors:                break        else:            numErrors = 0crawlWitdId()
0 0
原创粉丝点击