学习python写网络爬虫（一）

来源：互联网发布：微信公众号源码下载编辑：程序博客网时间：2024/04/18 14:07

寻找网站所有者，可以使用WHOIS协议查看域名的注册者是谁。使用whois模块可以查看。
在linux在安装模块：pip install python-whois
在windows安装模块：
1. 下载模块并解压
2. 打开cmd，定位的解压模块目录
3. 运行命令：setup.py build
setup.py install
4. 重新打开python IDE， import 模块名称，没报错则安装成功

#最简单的爬虫import urllib2def download(url):    return urllib2.urlopen(url).read()print download('http://www.cnblogs.com/guoyongheng')

#更加健壮的版本，可以捕获异常了import urllib2def download(url):    print 'Downloading:',url    try:        html = urllib2.urlopen(url).read()    except urllib2.URLError as e:        print 'Download error:',e.reason        html = None    return htmlprint download('http://www.cnblogs.com/guoyongheng')

#如果发生5xx类型的错误，可以重试下载import urllib2def download(url,num_retries = 2):    print 'Downloading:',url    try:        html = urllib2.urlopen(url).read()    except urllib2.URLError as e:        print 'Download error:',e.reason        html = None        if num_retries > 0:            if hasattr(e,'code') and 500 <= e.code < 600:                return download(url,num_retries-1)    return htmlprint download('http://httpstat.us/500')

#为了下载更加可靠，设置了一个默认的用户代理“wswp”#与之前写的代码的对比就是加了代理之后，爬我的csdn博客时可以#爬下来了，而不加代理的时候，则无法爬取import urllib2def download(url, user_agent = 'wswp', num_retries = 2):    print 'Downloading:',url    headers = {'User-agent':user_agent}    request = urllib2.Request(url,headers=headers)    try:        html = urllib2.urlopen(request).read()    except urllib2.URLError as e:        print 'Download error:',e.reason        html = None        if num_retries > 0:            if hasattr(e,'code') and 500 <= e.code < 600:                return download(url,num_retries-1)    return htmlprint download('http://blog.csdn.net/gyhguoge01234')

0 0