python网络爬虫基础(利用HTMLParser)

来源：互联网发布：七月算法邹博编辑：程序博客网时间：2024/06/06 02:54

该程序爬虫对象是今日神段里的热门文章,利用HTMLParser和正则表达式

from html.parser import *import urllib.requestimport reclass Scraper(HTMLParser):    def handle_starttag(self,tag,attrs):        if tag=='a':            attrs=dict(attrs)            if(attrs.__contains__("title")):                try:                    page={}                    page["link"]=attrs["href"]                    page["target"]=attrs["target"]                    page["title"]=attrs["title"]                    #page["article-id"]=attrs["article-id"]                    message.append(page)                except Exception:                    print("捕捉错误")                    print(attrs)message=[]url="http://pinyin.sogou.com/zimeiti/tag/%E4%BB%8A%E6%97%A5%E7%A5%9E%E6%AE%B5"webpage=urllib.request.urlopen(url).read().decode()parser=Scraper()parser.feed(webpage)while True:    index=0    for each in message:        index+=1        print(r"page:%2d title:%s"%(index,each['title']))    parser.close()    num=int(input("输入需要阅读的文章序号: "))    nextpage=urllib.request.urlopen("http://pinyin.sogou.com"+message[num-1]["link"]).read().decode()    pat='">(&nbsp; )?([^a-z\nA-Z<&]*?)(&nbsp)?[<br|</span>]'    date=re.findall(pat,nextpage)    for ts in date:        if(ts[1]=='0'):break#略过广告        if(ts[1]!=''):            print(ts[1])    print('');    print('')    input("按下任意键返回目录界面")

获取的数据,page1-3不能打开所以会出现报错问题

HTMLPaser的基本用法介绍:

本程序中仅仅运用到handler_starttag(self,tag,attrs)

在获取到开头标记时会进入该函数，tag便是开头的字符

本程序中tag是'a'即开始为<a

attrs为该段字段dict(attrs)转换成字典,之后根据键值获取相关参数

获取基本信息后,下一个文章链接构造比较复杂所以这里使用正则表达式获取文章主体

阅读全文

1 0