Python 爬取小说(诛仙)

来源：互联网发布：淘宝图片搜索功能编辑：程序博客网时间：2024/04/29 23:47

爬虫正在学习中，感觉不能只看不动手，

于是，抓个小说试试手，

目前感觉正则表达式非常不熟悉，

主要是要用于匹配以及去除一些div br 等

标签存入txt中，一直在查资料。。。

#coding: utf-8from bs4 import BeautifulSoupimport  urllib2import retitle=[]    #小说名href=[]     #链接url = 'http://www.biquge.tw/26_26491/'response = urllib2.urlopen(url)html_cont=response.read()soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')hrefAndname = soup.find("div", {"id":"list"}).findAll("a")#for item in hrefAndname:#    href.append(item['href'])for item in hrefAndname: #保存小说名和链接    if re.findall(re.compile(ur'\u7b2c.+\u7ae0'),item.text):     #  print item.text.encode('utf-8')       title.append(item.text)       href.append(item['href'])for i in range(len(href)):    try:        print "爬取第"+str(i+1)+"章中……"        newurl = 'http://www.biquge.tw'+ href[i]        response = urllib2.urlopen(newurl)        html_cont = response.read()        soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')        content = soup.find("div", {"id":"content"})        cont=str(content)        cont = re.sub(r'<\s*script[^>]*>[^<]*<\s*/\s*script\s*>','',cont)        cont = re.sub(r'</div>','',cont)    #删除br标签        cont = re.sub(r'<div\s\S*>','',cont)        cont = re.sub(r'<br/>','\n',cont)   #替换换行符     #   f = open("E:/res/"+ str(i+1)+ ' .txt','w')        f = open("E:/res/"+title[i]+'.txt','w')        f.write(cont)        f.close        print "success"    except:        print  "Sorry， 爬取第"+str(i+1)+"章失败"

《诛仙》还算短的，如果小说再长点，

时间会很久。

后续的，准备学下python中的多线程，

改进下爬取速度。。

0 0