Python写的爬取知乎的最多一百篇文章

来源:互联网 发布:python 多进程写文件 编辑:程序博客网 时间:2024/05/19 02:17

这几天经常上知乎,觉得里面有些文章或者回答确实不错。就花了晚上时间写了这个爬虫。以前没有用Python写过独立的程序,所以这个程序bug比较多。现在贴出的代码可以运行,会在同级目录上生成zhihu_jingxuan.txt,该txt中就是爬取的文章。主要的问题是,当爬取的文章过多时,就会报超出最大循环嵌套数的错误。简单的查了一下,python最大允许的循环前套数是10000。用到了beautifulsoup库,觉得它里面获取标签的时候应该是用了迭代,导致超出了最大循环数。再次记录一下,有空看看源码。


#coding:utf-8import urllibfrom bs4 import BeautifulSoupimport reurl = "http://www.zhihu.com"filename = "zhihu_jingxuan.txt"def parseArticleFromHtml(html):soup = BeautifulSoup(html)result = "<<"+soup.html.head.title.string+">>\r\n"for i in soup.findAll('div',{'class':'zm-editable-content'}):tmp = iif tmp is not None:tmp2 = str(tmp)tmp3 = re.sub('<[^>]+>',"\r\n",tmp2)result += "*************************\r\n"# try:result += tmp3result +="\r\n"# except:# continueresult +="<><><><><><><><><><>"for ii in range(5):result = result.replace("\r\n\r\n","\r\n")return resultdef parseArticleFromLink(link):print linkhtml = urllib.urlopen(link)content = html.read()html.close()# try:article_string = parseArticleFromHtml(content)myfilewriter = file(filename,'a+')  myfilewriter.write("\r\n")myfilewriter.write(article_string)  myfilewriter.close()# except UnicodeEncodeError:# passreturnmylist = []html = urllib.urlopen(url)content = html.read()html.close()soup = BeautifulSoup(content)info_cards = soup.findAll('a',{'class':'rep'})for an_info_cards in info_cards:print an_info_cards.span.stringnewlink = url+dict(an_info_cards.attrs)["href"]newhtml = urllib.urlopen(newlink)newcontent = newhtml.read()newhtml.close()newsoup = BeautifulSoup(newcontent)question_links = newsoup.findAll('a',{'class':'question_link'})for a_question_link in question_links:article_link = url+dict(a_question_link.attrs)["href"]#         parseArticleFromLink(article_link)if "answer" in article_link:mylist.append(article_link)print len(mylist)counter = 100if(len(mylist)>counter):for item in range(counter):print itemparseArticleFromLink(mylist[item])else:for item in mylist:parseArticleFromLink(item)

翔神说在他那里运行,会有utf-8转unicode的错误。大概是系统默认编码不同导致的。我的电脑上没问题。

0 0