爬虫程序（一）---读取网页

来源：互联网发布：yy是什么软件编辑：程序博客网时间：2024/06/05 03:02
读取网页时候，如果访问速度过快，会返回个timeout错误（10054），因此要在此做个try，并启用下一个代理。（代理可以百度，http 代理，要有端口号）。同时模拟浏览器，可以防止一些返回错误。
#读取网页函数
def FormatHTML( url ):    flag = True    count = 0    sleep_download_time = 0    time_out = 10    fails = 0    HTTP_num = 0    HTTP_dl = ['211.142.236.132:80', '118.186.9.21:80', '118.186.9.22:80', '211.142.236.132:80']    while True:        if fails >= 3:            return None            break        try:            print u'========开启代理========='            opener = urllib2.build_opener( urllib2.ProxyHandler( {'http':HTTP_dl[HTTP_num]} ), urllib2.HTTPHandler( debuglevel = 1 ) )            urllib2.install_opener( opener )            while flag:                try:                    print u'=========模拟浏览器========='                    i_headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1) Gecko/20090624 Firefox/3.5", "Referer": 'http://www.baidu.com'}                    req = urllib2.Request( url, headers = i_headers )                    time.sleep( sleep_download_time )                    print u'==========读取网页==========='                    f = urllib2.urlopen( req, timeout = time_out )                    flag = False                except urllib2.HTTPError, e:                      if e.code == 404:                        print 'e.code:' + str( e.code )                         count += 1                        print 'count=' + str( count )                        if count >= 4:                            print 'count==' + str( count )                            flag = False                            return None                     else:                        sleep_download_time = sleep_download_time + 2                        time.sleep( sleep_download_time )                        count += 1                        print 'urllib2.HTTPError:' + str( e.code )                        s = sys.exc_info()                        print s                        print "Error '%s' happened on line %d" % ( s[1], s[2].tb_lineno )                        if count == 10:                            flag = False                            return None                except :                    print 2                    sleep_download_time = sleep_download_time + 2                    time.sleep( sleep_download_time )                    count += 1                    print url                    print u"连接超时!"                    s = sys.exc_info()                    print "Error '%s' happened on line %d" % ( s[1], s[2].tb_lineno )                    if count == 10:                        flag = False                        return None                      reader = BeautifulSoup( f.read() )             print u'==========读取完毕==========='             f.close()            break        except():            HTTP_num += 1            s = sys.exc_info()            print "Error '%s' happened on line %d" % ( s[1], s[2].tb_lineno )            fails += 1            time_out += 5    return reader

 
	
					
					   爬虫程序（一）---读取网页
	  	   爬虫程序（二）---读取网页
	  	   网页爬虫（一）
	  	   网页爬虫程序pageSpider
	  	   网页爬虫小程序
	  	   Python3 爬虫（一）--  简单网页抓取
	  	   Python3 爬虫（一）-- 简单网页抓取
	  	   网页爬虫  静态网页<一>
	  	   网页爬虫程序开发经验谈
	  	   网页爬虫程序开发经验谈
	  	   第一个网页爬虫程序
	  	   一、python爬虫程序入门（图片下载）
	  	   爬虫爬虫爬虫（一）
	  	   Python伪装浏览器爬虫读取网页内容
	  	   Python3伪装浏览器爬虫读取网页内容
	  	   Python伪装浏览器爬虫读取网页内容
	  	   Python 学习（6）---简单的网页爬虫程序
	  	   网络爬虫（一）-------抓取网页之理解URL
	     		  
	  	   HTML5的两种Loading效果
	  	   ios6新特性
	  	   google地图 图标移动，地址搜索，双击图标返回地址
	  	   jq ui 日历控件
	  	   Criteria(1)
	  	   爬虫程序（一）---读取网页
	  	   SQL 表复制
	  	   Objective-C中的NSBundle
	  	   php生成RSS类
	  	   CXF 安全认证
	  	   排序 时间空间复杂度稳定性分析
	  	   criteria(2)
	  	   KMP
	  	   java 全角字符转半角字符