用Python写网络爬虫系列（一）

来源：互联网发布：淘宝黑搜二期编辑：程序博客网时间：2024/04/25 23:22

从两个新认识的包说起：builtwith，whois。所使用的Anaconda 4.1.1没有预设这两个包。所以需要自己加入导入方法：pip install builtwith 用来导入builtwith 。pip install python-whois

这两个包有什么作用。用来做什么？

builtwith：用来查看某个网站使用的是什么样的技术代码示例：

import builtwith #导入这个包builtwith.parse('http://example.webscraping.com') #调用builtwith函数来查看使用了什么样的技术

运行结果如下：

{u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'], u'programming-languages': [u'Python'], u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'], u'web-servers': [u'Nginx']}

结果如字面意思有一定的web基础应该能很轻松的看得懂

whois：顾名思义，就是用来查看这个网站是谁的代码示例：

import whoisprint whois.whois('www.appspot.com')

运行结果如下：

......#运行的结果过长所以省略了部分结果只保留了关键部分  "country": "US",   "whois_server": "whois.markmonitor.com",   "state": "CA",   "registrar": "MarkMonitor, Inc.",   "referral_url": "http://www.markmonitor.com",   "address": "2400 E. Bayshore Pkwy",   "name_servers": [    "NS1.GOOGLE.COM",     "NS2.GOOGLE.COM",     "NS3.GOOGLE.COM",     "NS4.GOOGLE.COM",     "ns2.google.com",     "ns1.google.com",     "ns3.google.com",     "ns4.google.com"  ],   "org": "Google Inc.",   "creation_date": [    "2005-03-10 00:00:00",     "2005-03-09 18:27:55"  ],   "emails": [    "abusecomplaints@markmonitor.com",     "dns-admin@google.com"  ]}

从结果可以看得出来，这个域名是属于谷歌的。
所以为什么要用这两个代码?

在爬虫的时候先查看一下大概对方用了什么样的技术，以及你所要面对的是一个怎么样的公司，大公司反爬技术通常都很厉害，所以有时候看一看还是很必要的。

编写第一个爬虫函数：

这里先使用urllib2来做个小的入门。直接贴上代码，加上注释：

import  urllib2 #不多说def download(url,user_agent='wswp',num_retries=2): #定义的时候设置三个参数 分别是 URL、User_agent和num_retries    print 'Downloading:',url   #先打印一下自己要查看的网页      headers={'User-agent':user_agent}              #设置一下代理服务     request=urllib2.Request(url,headers=headers)    #调用URLLIB2的Request方法来建立一个Request对象 能加入headers等等参数    try:        html= urllib2.urlopen(request).read()        #如果运行正常下载整个网页    except urllib2.URLError as e:                    #获取到异常并且打印出来看看是什么异常        print 'Download error:',e.reason        html=None        if num_retries>0:            if num_retries>0:                if hasattr (e,'code')and 500<=e.code<600:  #状态码为5开头的异常是属于服务器端临时抽风的问题所以可以再请求一次                    return download(url,user_agent,num_retries-1)    return htmlif __name__ == "__main__":    download('http://httpstat.us/500')

上面定义的方法是直接下载整个网络页面。在今天的学习过程当中我发现了两个比较有意思的网站，首先是报出1 2 3 4 5 开头的状态码是谁的锅？

详情请见：https://tools.ietf.org/html/rfc7231#section-6

一个可以选择不同的状态码作为返回值的网站http://httpstat.us/

爬虫改进：

我们对

http://example.webscraping.com/

网站进行浏览时候发现每个国家的详情页都是以一个view/国家名+“-“加一个NUM来作为访问地址。去掉国家名保留“-num”之后还是能够正常访问。所以在遍历全部的国家就变得非常的简单

import itertools  #导入itertools包用来方便迭代max_errors=5      #定义最大错误次数num_errors=0      #定义错误次数 for page in itertools.count(1): #因为不知道有多少张网页所以使用itertools.count方法从 1开始不断迭代下去    url = 'http://example.webscraping.com/view/-%d'%page    html= download(url)    if html is None:        num_errors=+1    #发生一次错误时记录         if num_errors==max_errors:   #直到错误满5次才停止重新下载            break    else:        num_errors=0;

0 0