python爬虫(1)_获取网页
来源:互联网 发布:移动网络测试 编辑:程序博客网 时间:2024/06/07 15:55
分析网站
- 识别对方使用技术-builtwith模块
pip install builtwith使用:>>> import builtwith >>> builtwith.parse("http://127.0.0.1:8000/examples/default/index"){u'javascript-frameworks': [u'jQuery'], u'font-scripts': [u'Font Awesome'], u'web-frameworks': [u'Web2py'], u'programming-languages': [u'Python']}
- 寻找网站所有者
安装:pip install python-whois使用:>>> import whois>>> print whois.whois("appspot.com"){ "updated_date": [ "2017-02-06 00:00:00", "2017-02-06 02:26:49" ], "status": [ "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited", "clientTransferProhibited https://icann.org/epp#clientTransferProhibited", "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited", "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited", "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)", "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)", "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)", "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)", "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)", "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)" ], "name": "DNS Admin", "dnssec": "unsigned", "city": "Mountain View", "expiration_date": [ "2018-03-10 00:00:00", "2018-03-09 00:00:00" ], "zipcode": "94043", "domain_name": [ "APPSPOT.COM", "appspot.com" ], "country": "US", "whois_server": "whois.markmonitor.com", "state": "CA", "registrar": "MarkMonitor, Inc.", "referral_url": "http://www.markmonitor.com", "address": "2400 E. Bayshore Pkwy", "name_servers": [ "NS1.GOOGLE.COM", "NS2.GOOGLE.COM", "NS3.GOOGLE.COM", "NS4.GOOGLE.COM", "ns1.google.com", "ns4.google.com", "ns2.google.com", "ns3.google.com" ], "org": "Google Inc.", "creation_date": [ "2005-03-10 00:00:00", "2005-03-09 18:27:55" ], "emails": [ "abusecomplaints@markmonitor.com", "dns-admin@google.com" ]}可以看到改域名归属google。
编写第一个爬虫
下载网页
要想爬取网页,我们首先将其下载下来,下面示例使用Python的urllib2模块下载url。
1. 基本写法
import urlib2def download(url): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None return html
- 重试下载
当爬取是对方服务器可能会返回500等服务端错误,当出现服务端错误时,我们可以试着重试下载,因为目标服务器是没有问题的,我们可以试着重试下载。
示例:
import urlib2def download(url, num_retries=3): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, num_retries - 1) return html
我们试着访问http://httpstat.us/500,该网站会返回500错误,代码如下:
if __name__ == '__main__': download("http://httpstat.us/500") pass
执行结果:
Downloading: http://httpstat.us/500Downloading error: Internal Server ErrorDownloading: http://httpstat.us/500Downloading error: Internal Server ErrorDownloading: http://httpstat.us/500Downloading error: Internal Server ErrorDownloading: http://httpstat.us/500Downloading error: Internal Server Error
可以看到重试了3次才会放弃,则重试下载成功。
3. 设置用户代理
python 访问网站时,默认使用Python-urllib/2.7作为默认用户代理,其中2.7为python的版本号,对于一些网站会拒绝这样的代理下载,所以为了正常的访问,我们需要重新设置代理。
import urllib2def download(url, num_retries=3, user_agent="wswp"): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, num_retries - 1, user_agent) return html
使用代理:
if __name__ == '__main__': user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" html = download("http://www.meetup.com",user_agent) print html pass
链接爬虫
使用链接爬虫可以爬下整个网站的链接,但是通常我们只需要爬下我们感兴趣的链接,所以我们,我们可以使用正则表达式来匹配,代码如下:
import urllib2import reimport urlparsedef download(url, num_retries=3, user_agent="wswp"): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, num_retries - 1, user_agent) return htmldef link_crawler(sell_url, link_regex): crawl_queue = [sell_url] seen = set(crawl_queue) while crawl_queue: url = crawl_queue.pop() html = download(url) for link in get_links(html): if re.match(link_regex, link): print link # check if have already seen this link link = urlparse.urljoin(sell_url, link) if link not in seen: seen.add(link) crawl_queue.append(link)def get_links(html): webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) return webpage_regex.findall(html)if __name__ == '__main__': user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)') pass
高级功能
- 支持代理
有时候一些网站屏蔽了很多国家,所以我们需要使用代理来访问这些网站,下面我们使用urllib2示例支持代码:
proxy = ...opener = urllib2.build_opener()proxy_params = {urlparse.urlparse(url).scheme:proxy}opener.add_handle(urllib2.ProxyHandler(proxy_param))response = opener.open(request)
将上面的代码集成到下载示例中,如下:
def download(url, num_retries=3, user_agent="wswp", proxy=None): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) # add proxy opener = urllib2.build_opener() if proxy: proxy_param = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_param)) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, num_retries - 1, user_agent, proxy) return html
- 下载限速
很多时候,我们处理爬虫时,经常会遇到由于访问速度过快,会面临被封禁或造成对面服务器过载的风险,为了能正常模拟用户的访问,避免这些风险,我们需要在两次下载之间添加延迟,对爬虫进行限速,实现示例如下:
import urlparseclass Throttle: """Add a delay between downloads to the same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # 存储访问一个网站的最后时间点 self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.datetime.new() - last_accessed).seconds if sleep_secs > 0: # 在访问网站之前延迟sleep_secs之后进行下次访问 time.sleep(sleep_secs) # 更新最后一次访问同一网站的时间 self.domains[domain] = datetime.datetime.new()
Throttle类记录了每个域名上次访问的时间,如果当前时间距离上次访问时间小于制定延迟,则执行睡眠操作,这样我们在每次下载之前调用Throttle对象对爬虫进行限速,集成之前的下载代码如下:
# 在下载之前添加throttle = Throttle(delay)...throttle.wait(url)result = download(url, num_retries=num_retries, user_agent=user_agent, proxy=proxy)
- 避免爬虫陷阱
所谓的爬虫陷阱是指:之前我们使用追踪链接或爬取该网站的所有链接,但是有一种情况,就是在当前页面包含下个页面的链接,下个页面包含下下个页面的链接,也就是可以无休止的链接下去,这种情况我们称作爬虫链接。
想要避免这种情况,一个简单的方法就是我们记录到达当前页面经过了多少链接,也就是我们说的深度,当达到最大深度时,爬虫不再向队列中添加该网页的链接,我们在之前追踪链接的代码上添加这样的功能,代码如下:
import urllib2import reimport urlparse# 新增限制访问页面深度的功能def download(url, num_retries=3, user_agent="wswp"): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Downloading error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, num_retries - 1, user_agent) return htmldef link_crawler(sell_url, link_regex, max_depth=2): max_depth = 2 crawl_queue = [sell_url] # 将seen修改为一个字典,增加页面访问深度的记录 seen = {} seen[sell_url] = 0 while crawl_queue: url = crawl_queue.pop() html = download(url) # 获取长度,判断是否到达了最大深度 depth = seen[url] if depth != max_depth: for link in get_links(html): if re.match(link_regex, link): print link # check if have already seen this link link = urlparse.urljoin(sell_url, link) if link not in seen: seen[link] = depth + 1 crawl_queue.append(link)def get_links(html): webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) return webpage_regex.findall(html)if __name__ == '__main__': user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)') pass
当然了,如果你想禁用这个功能,只需要将max_depth设为负数即可。
- 最终版本代码
import reimport urlparseimport urllib2import timefrom datetime import datetimeimport robotparserimport Queuedef link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1): """Crawl from the given seed URL following links matched by link_regex """ # the queue of URL's that still need to be crawled crawl_queue = Queue.deque([seed_url]) # the URL's that have been seen and at what depth seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) throttle = Throttle(delay) headers = headers or {} if user_agent: headers['User-agent'] = user_agent while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): throttle.wait(url) html = download(url, headers, proxy=proxy, num_retries=num_retries) links = [] depth = seen[url] if depth != max_depth: # can still crawl further if link_regex: # filter for links matching our regular expression links.extend(link for link in get_links(html) if re.match(link_regex, link)) for link in links: link = normalize(seed_url, link) # check whether already crawled this link if link not in seen: seen[link] = depth + 1 # check link is within same domain if same_domain(seed_url, link): # success! add this new link to queue crawl_queue.append(link) # check whether have reached downloaded maximum num_urls += 1 if num_urls == max_urls: break else: print 'Blocked by robots.txt:', urlclass Throttle: """Throttle downloading by sleeping between requests to same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.now() - last_accessed).seconds if sleep_secs > 0: time.sleep(sleep_secs) self.domains[domain] = datetime.now()def download(url, headers, proxy, num_retries, data=None): print 'Downloading:', url request = urllib2.Request(url, data, headers) opener = urllib2.build_opener() if proxy: proxy_params = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_params)) try: response = opener.open(request) html = response.read() code = response.code except urllib2.URLError as e: print 'Download error:', e.reason html = '' if hasattr(e, 'code'): code = e.code if num_retries > 0 and 500 <= code < 600: # retry 5XX HTTP errors return download(url, headers, proxy, num_retries-1, data) else: code = None return htmldef normalize(seed_url, link): """Normalize this URL by removing hash and adding domain """ link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates return urlparse.urljoin(seed_url, link)def same_domain(url1, url2): """Return True if both URL's belong to same domain """ return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netlocdef get_robots(url): """Initialize robots parser for this domain """ rp = robotparser.RobotFileParser() rp.set_url(urlparse.urljoin(url, '/robots.txt')) rp.read() return rpdef get_links(html): """Return a list of links from html """ # a regular expression to extract all links from the webpage webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html)if __name__ == '__main__': user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)', delay=0, num_retries=1, user_agent=user_agent)
上面就是集成以上所有功能的版本,现在我们可以使用这个爬虫执行看看效果了,在终端输入:python xxx.py(xxx.py是你在上面保存代码的文件名)
- python爬虫(1)_获取网页
- Python网络爬虫(1)获取网页
- python 爬虫获取网页图片
- Python 爬虫:获取网页图片
- python 网页爬虫_正则匹配
- python-获取提取网页url爬虫学习(1)
- Python爬虫_获取贴吧内容
- Python爬虫第一步之获取网页源代码
- python 爬虫 获取网页中的图片
- Python爬虫_简单获取百度贴吧图片
- Python分布式爬虫前菜(1):关于静态动态网页内容获取的N种方法
- Python分布式爬虫前菜(1):关于静态动态网页内容获取的N种方法
- python3爬虫1--简单网页源代码获取
- Python网络爬虫(一)-----获取网页数据
- Python篇----Requests获取网页源码(爬虫基础)
- Python篇----Requests获取网页源码(爬虫基础)
- 多线程获取豆瓣网页的网络爬虫(Python实现)
- 爬虫获取网页编码
- jQuery选择器
- C#自定义类库
- Spring bean中的properties元素内的name 和 ref都代表什么意思
- mdev.conf 的编写方法
- Java的选择排序算法
- python爬虫(1)_获取网页
- 团队管理
- NAT网络地址转换
- 关于myEclipse2016部署项目后jsp文件编码集变成ansi问题解决
- 在Vue-cli里应用Vuex的state和mutations
- 协同过滤 之 User-based Collaborative Filtering
- IntelliJ Idea SpringBoot 数据库增删改查实例
- Bash Scripts
- 片内flash保存数据