python爬虫(二)爬取知乎问答

来源:互联网 发布:淘宝 瑕疵冰箱可靠吗 编辑:程序博客网 时间:2024/06/05 17:19

都说知乎上问答的质量挺高,刚学爬虫没几天,现在对其问答内容进行爬虫实验。

在知乎首页,通过输入关键词,搜索问题,之后点击问题找到该问题对应的网友回答。

根据该过程,爬虫过程需要分为两步:

1、通过关键词(Java)搜索问题,得到url=https://www.zhihu.com/search?type=content&q=java,根据该url爬取该页面下所有的问题及其对应的问题id;

2、根据第一步得到的问题及其id,得到url=https://www.zhihu.com/question/31437847,爬取该url页面下所有的网友回答。


具体代码如下(https://github.com/tianyunzqs/crawler/tree/master/zhihu)

#!usr/bin/env python# -*- coding: utf-8 -*-import refrom urllib import request, parsefrom bs4 import BeautifulSoupkeyword_list = ['svm', '支持向量机', 'libsvm']fout = open("E:/python_file/zhihu.txt", "w", encoding="utf-8")for keyword in keyword_list:    print(keyword)    url = 'https://www.zhihu.com/search?type=content&q=' + parse.quote(keyword)    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) ' \                 'Chrome/39.0.2171.95 Safari/537.36'    headers = {'User-Agent': user_agent}    keyword_question_url_list = {}    try:        req = request.Request(url, headers=headers)        response = request.urlopen(req, timeout=5)        content = response.read().decode('utf-8')        soup = BeautifulSoup(content, 'html.parser')        all_div = soup.find_all('li', attrs={'class': re.compile('item clearfix.*?')})        question_url_list = {}        for e_div in all_div:            title = e_div.find_all('a', attrs={'class': 'js-title-link',                                               'target': '_blank',                                               'href': re.compile('/question/[0-9]+')})            if title:                title = title[0].text                _id = e_div.find_all('link', attrs={'itemprop': 'url',                                                    'href': re.compile('/question/[0-9]+/answer/[0-9]+')})                href = _id[0].attrs.get('href')                pattern = re.compile('/question/(.*?)/answer/(.*?)$', re.S)                items = re.findall(pattern, href)                question_id = items[0][0]                question_url_list[title] = 'https://www.zhihu.com/question/' + question_id            else:                title_id = e_div.find_all('a', attrs={'class': 'js-title-link',                                                   'target': '_blank',                                                   'href': re.compile('https://zhuanlan.zhihu.com/p/[0-9]+')})                if title_id:                    title = title_id[0].text                    href = title_id[0].attrs.get('href')                    question_url_list[title] = href                else:                    continue        keyword_question_url_list[keyword] = question_url_list        # for q, d in question_url_list.items():        #     print(q, d)    except:        continue    for keyword, question_url_list in keyword_question_url_list.items():        for question, url in question_url_list.items():            fout.write(question + "\n")            try:                req = request.Request(url, headers=headers)                with request.urlopen(req, timeout=5) as response:                    content = response.read().decode('utf-8')                    soup = BeautifulSoup(content, 'html.parser')                    all_div = soup.find_all('div', attrs={'class': 'List-item'})                    for e_div in all_div:                        answer = e_div.find_all('span', attrs={'class': 'RichText CopyrightRichText-richText',                                                               'itemprop': 'text'})                        answer = answer[0].text                        fout.write(answer + "\n")            except request.URLError as e:                if hasattr(e, "code"):                    print(e.code)                if hasattr(e, "reason"):                    print(e.reason)


存在的问题:

以上程序可以很好完成第一步,但是第二步只能取到问题的前2个回答。

根据http://www.cnblogs.com/buzhizhitong/p/5697526.html的介绍,应该可以用Selenium+Phantomjs来解决,以后再尝试。