【scrapy】模拟登陆知乎

来源：互联网发布：linux 删除匹配文件编辑：程序博客网时间：2024/05/21 22:47

这个网上有个通用的教程，然而为这个教程已经花费了太多时间进行调试，和知乎上的朋友交流，很多人也是这个地方遇到了问题，最后的结果。。是放弃了crawlspider。。

先贴下这个链接。。。http://ju.outofmemory.cn/entry/105646 谨慎。。

针对上面这个教程，遇到的几点问题：

问题1：知乎的登陆url不再是/login了，根据email和phonenum分为/login/phone_num和login/email。因此start_requests的里的url需要更改

问题2：根据文档中，模拟登陆的FormRequest.from_response，在after_login中print response.body发现还是登陆页，这个也有人遇到，但是根据他的解释应该是登陆成功，但是获取url的方法没有调用到。。这个我没做，不过我自己放弃了，直接使用formRequest提交数据，并且FormRequest.from_response貌似是get方法，改成“method=post”,返回403。不知道是不是method不能改还是其他原因。

formRequest可以设置method为post。但是在after_login中发现after_login中打印response.body，还是登陆页

问题3：最初针对问题2，我的解决思路是，在after_login里，重新使用登陆后的cookie重新访问zhihu.com，在make_request_from_url里，结果返回了

no more duplicates will be shown(see dupefilter_debug to show all duplicates)

问题4：在post_login里使用formRequest后，在after_login中打印response.body，返回{r'0',msg:''}调用构建个人主页的request的话，是可以获取到的，response但是设置start_urls为people/****后，yield make_request_from_url(start_urls)会出现302重定向问题，同时parse_page里解析依旧是首页

问题5：拿一个不用登陆的url测试“https://www.zhihu.com/question/21872451“ ：

在after_login里：

return [Request("https://www.zhihu.com/question/21872451",meta={'cookiejar':response.meta['cookiejar']},headers = self.headers_zhihu,callback=self.parse_page)]

发现可以解析当前页，并且但是rule规则不生效，并且”https://www.zhihu.com/question/21872451“后台parse_page调用了两次，但是当前页面的登陆状态时可以获取到的
”

问题6：针对5，反过来测试下，修改start_url=”https://www.zhihu.com/question/21872451“,依然调用yield make_request_from_url(start_urls)，发现登陆状态又获取不到了；继续改回

return [Request("https://www.zhihu.com/question/21872451",meta={'cookiejar':response.meta['cookiejar']},headers = self.headers_zhihu,

<span style="white-space:pre"></span>#callback=self.parse_page

)]

但是把callback注释掉，发现/question/21872451 解析不到，同时，rule生效，调用parse_page，登陆状态没有。

总结：

1 make_requests_from_urls:如果不设置回调函数，会调用默认的parse，同时调用原生态的make_request_from_urls不会携带哦cookie,所以要复写，同时，如果callback也设置和rule一致的话，会出现首页解析正确，rule不生效(有人说是因为冲突，因此我试了下更改make_requests_from_urls，callback=‘parse_item’，发现rule还是不生效，parse_page没有执行，所以问题应该不在那里)，反过来，如果callback不设置的话，rule生效，但是首页解析不到，且无登陆状态。

2 crawlspider的rule是不能自动携带cookie构建request，同时不能复写parse()，这个是官方文档的说明，如果重写parse会运行失败。

结论：放弃了crawlspider,选择复写parse(),在parse中构建自己的request

最终形成的登陆

zhihu.py

# -*- coding: utf-8 -*-import scrapyfrom bs4 import BeautifulSoupfrom MyTest.items import *from scrapy.http import Request, FormRequestfrom scrapy.selector import Selectorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.spider import BaseSpiderimport urlparsefrom scrapy import logclass ZhihuSpider(BaseSpider):    name = "zhihu"    #allowed_domains = ["zhihu.com"]    start_urls = (        'https://www.zhihu.com/',    )    headers_zhihu = {           'Host':'www.zhihu.com ',           'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0',           'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',           'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',           'Accept-Encoding':'gzip,deflate,sdch',           'Referer':'https://www.zhihu.com ',           'If-None-Match':"FpeHbcRb4rpt_GuDL6-34nrLgGKd.gz",           'Cache-Control':'max-age=0',           'Connection':'keep-alive'          # 'cookie':cookie    }    def start_requests(self):        return [Request("https://www.zhihu.com/login/phone_num",meta={'cookiejar':1},headers = self.headers_zhihu,callback=self.post_login)]    def post_login(self,response):        print 'post_login'        xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]  #不见【0】输出错误        print 'xsrf'+xsrf        return [FormRequest('https://www.zhihu.com/login/phone_num',                method='POST',                meta = {                    'cookiejar': response.meta['cookiejar'],                    '_xsrf':xsrf                },                headers = self.headers_zhihu,                formdata = {                    'phone_num':'******',  #这里的参数值不能去掉''                    'password':'*****',                     '_xsrf':xsrf                },                callback = self.after_login,                #dont_filter = True        )]    def after_login(self,response):        print 'after_login'        print response.body    # 返回msg        for url in self.start_urls:            print 'url...................'+url            yield self.make_requests_from_url(url,response)    def make_requests_from_url(self, url,response):        return Request(url,dont_filter=True, meta = {                 'cookiejar':response.meta['cookiejar'],                  'dont_redirect': True,                  'handle_httpstatus_list': [301,302]            },                 #      callback=self.parse                       )    def parse(self, response):        items = []        problem = Selector(response)        item = ZhihuItem()        name = problem.xpath('//span[@class="name"]/text()').extract()        print name        item['name'] = name        urls = problem.xpath('//a[@class="question_link"]/@href').extract()        print urls        item['urls'] = urls        print 'response ............url'+response.url        item['url'] = response.url        print item['url']        items.append(item)        yield item                                                     #返回item        for url in urls:            print url            yield scrapy.Request(urlparse.urljoin('https://www.zhihu.com', url),dont_filter=True,   #直接使用url会报错                 meta = {                 'cookiejar':response.meta['cookiejar'],               #设置cookiejar                  'dont_redirect': True,                               #防止重定向                  'handle_httpstatus_list': [301,302]            },                       callback=self.parse                       )        #return  item

setting.py

COOKIES_ENABLED = TrueCOOKIES_DEBUG = True

其他的处理和之前爬取qiubai差不多，就不多解释了

遗留问题：为什么make_request_from_url设置回调后，rule不生效

start_urls如果设置符合rule规则，为什么也没做解析

0 0