小白Python3爬虫

来源：互联网发布：js获取input 编辑：程序博客网时间：2024/05/20 15:38

python网络爬虫Ver 1.0 alpha

爬虫入口是feng的Startup News(http://news.dbanotes.net/)

程序如下：

import reimport urllib.requestimport urllibfrom collections import  dequequeue=deque()#空队列visited=set()#空集合，用来存储访问过的网址url='http://news.dbanotes.net/news'#入口页面可以换成别的queue.append(url)#将入口网址放入队列当中cnt=0#循环直到队列不为空while queue:    url=queue.popleft()#队首元素出队    visited|={url}#标记为已访问    print('已经抓取：'+str(cnt)+'正在抓取<---'+url)    cnt+=1    urlop=urllib.request.urlopen(url)

        #if 'html' not in urlop.gentheader('Content-Type'):     #   continue    #避免程序异常中止，用try..catch处理异常    try:        data=urlop.read()        data = data.decode('Utf-8')    except:        continue    #用正则表达式提取页面中所有队列，并判断是否已经访问过，然后加入待爬队列    linkre=re.compile('href=(.+?)"')    for x in linkre.findall(data):        if 'http' in x and x not in visited:            queue.append(x)            print('加入队列--->'+ x)

程序结果如图，我也不知道对不对，但总归是爬到一点东西，我就不深思了，反正我这个小白也不会懂

。程序里面有两行被注释掉的程序是因为我自己跑不通，所以只能注释掉了

今天就暂时到这吧嘻嘻

转载于https://jecvay.com/2014/09/python3-web-bug-series2.html

阅读全文

0 0