[python]我的第一只爬虫

来源:互联网 发布:思创医惠 人工智能 编辑:程序博客网 时间:2024/06/02 07:11

我的第一只爬虫

数据源

[ 糗百 ] http://www.qiushibaike.com/hot/page/2

打开糗百主页,查看html源文件

数据源截图

代码

抓取作者名字

#coding=utf-8import urllibimport urllib2import repage = 2url = 'http://www.qiushibaike.com/hot/page/' + str(page)user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = { 'User-Agent' : user_agent }try:    request = urllib2.Request(url,headers = headers)    response = urllib2.urlopen(request)    content = response.read().decode('utf-8')    pattern = re.compile('<div.*?author.*?<a.*?</a>.*?<a.*?title="(.*?)">.*?<h2>(.*?)</h2>.*?</a>.*?</div>',re.S)    items = re.findall(pattern,content)    for item in items:        print item[0]except urllib2.URLError, e:    if hasattr(e,"code"):        print e.code    if hasattr(e,"reason"):        print e.reason

结果

Python 2.7.2 |EPD_free 7.2-2 (32-bit)| (default, Sep 14 2011, 11:02:05) [MSC v.1500 32 bit (Intel)] on win32Type "copyright", "credits" or "license()" for more information.>>> ================================ RESTART ================================>>> 挖鼻孔的老虎loser...........Dan喵@胖妞向阳河单名一个饭字欲湖冰心王爷有人陌路莫回。媚娘向阳河♂ART⌒oOㄣ季向晚哥哥这个冬天冻成狗哈哈大好时光王冰痕壞壊_智商都用来卖萌啦二女子、(驹迷)超越_后来!
1 0
原创粉丝点击