scrapy爬取拉勾网

来源:互联网 发布:java方法如何命名 编辑:程序博客网 时间:2024/06/11 03:51

2. 初始化拉钩网项目并解读crawl源码

scrapy genspider --list 
查看可使用的初始化模板 
ailable templates: 
- basic 
- crawl 
- csvfeed 
- xmlfeed

scrapy genspider -t crawl lagou www.lagou.com
  • 1
  • 1

cmd与pycharm不同,mark root 
setting.py 设置目录

crawl模板

class LagouSpider(CrawlSpider):    name = 'lagou'    allowed_domains = ['www.lagou.com']    start_urls = ['http://www.lagou.com/']    rules = (        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),    )    def parse_item(self, response):        i = {}        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()        #i['name'] = response.xpath('//div[@id="name"]').extract()        #i['description'] = response.xpath('//div[@id="description"]').extract()        return i
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

源码阅读剖析 
https://doc.scrapy.org/en/1.3/topics/spiders.html#crawlspider

提供了一些可以让我们进行简单的follow的规则,link,迭代爬取

rules:

规则,crawel spider读取并执行

parse_start_url(response):

example:

rules是一个可迭代对象,里面有Rule实例->LinkExtractor的分析 
allow=('category\.PHP', ), callback='parse_item', 
allow允许的url模式。callback,要回调的函数名。 
因为rules里面没有self,无法获取到方法。

import scrapyfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com']    rules = (        # Extract links matching 'category.php' (but not matching 'subsection.php')        # and follow links from them (since no callback means follow=True by default).        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),        # Extract links matching 'item.php' and parse them with the spider's method parse_item        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),    )    def parse_item(self, response):        self.logger.info('Hi, this is an item page! %s', response.url)        item = scrapy.Item()        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()        return item 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

分析拉勾网模板代码 
1. 将http加上s 
2. 重命名parse_item为我们自定义的parse_job 
3. 点击class LagouSpider(CrawlSpider):的CrawlSpider,进入crawl源码 
4. class CrawlSpider(Spider):可以看出它继承于spider 
5. 入口:def start_requests(self): 
6. alt+左右方向键,不同代码跳转 
7. 5->之后默认parse CrawlSpider里面有parse函数。但是这次我们不能向以前一样覆盖

Crawl.py核心函数parse。

parse函数调用_parse_response

 def parse(self, response):        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
  • 1
  • 2
  • 1
  • 2

_parse_response

  1. 判断是否有callback即有没有self.parse_start_url
  2. 我们可以重载parse_start_url加入自己的处理
  3. 把参数传递给函数,并调用process_results函数

_parse_response函数

    def _parse_response(self, response, callback, cb_kwargs, follow=True):        if callback:            cb_res = callback(response, **cb_kwargs) or ()            cb_res = self.process_results(response, cb_res)            for requests_or_item in iterate_spider_output(cb_res):                yield requests_or_item        if follow and self._follow_links:            for request_or_item in self._requests_to_follow(response):                yield request_or_item
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

parse_start_url的return值将会被process_results方法接收处理 
如果不重写,因为返回为空,然后就相当于什么都没做

    def process_results(self, response, results):        return results
  • 1
  • 2
  • 1
  • 2

点击followlink

    def set_crawler(self, crawler):        super(CrawlSpider, self).set_crawler(crawler)        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
  • 1
  • 2
  • 3
  • 1
  • 2
  • 3

如果setting中有这个参数,则可以进一步执行到parse

_requests_to_follow 
1. 判断传入的是不是response,如果不是直接returns 
2. 针对当前response设置一个空set,去重 
3. 把self的rules通过enumerate变成一个可迭代对象 
4. 跳转rules详情 
5. 拿到link通过link_extractor.extract_links抽取出具体的link 
6. 执行我们的process_links 
7. link制作完成发起Request,回调_response_downloaded函数 
8. 然后执行parse_respose

    def _requests_to_follow(self, response):        if not isinstance(response, HtmlResponse):            return        seen = set()        for n, rule in enumerate(self._rules):            links = [lnk for lnk in rule.link_extractor.extract_links(response)                     if lnk not in seen]            if links and rule.process_links:                links = rule.process_links(links)            for link in links:                seen.add(link)                r = Request(url=link.url, callback=self._response_downloaded)                r.meta.update(rule=n, link_text=link.text)                yield rule.process_request(r)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

_compile_rules

  1. 在我们初始化时会调用_compile_rules
  2. copy.copy(r) for r in self.rules]将我们的rules进行一个copy
  3. 调用回调函数get_method。
  4. 调用rules里面我们定义的process_links
  5. 调用rules里面我们定义的process_request
    def _compile_rules(self):        def get_method(method):            if callable(method):                return method            elif isinstance(method, six.string_types):                return getattr(self, method, None)        self._rules = [copy.copy(r) for r in self.rules]        for rule in self._rules:            rule.callback = get_method(rule.callback)            rule.process_links = get_method(rule.process_links)            rule.process_request = get_method(rule.process_request)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
    self.process_links = process_links    self.process_request = process_request

可以通过在rules里面传入我们自己的处理函数,实现对url的自定义。 
达到负载均衡,多地不同ip访问。

_response_downloaded 
通过rule取到具体的rule 
调用我们自己的回调函数

    def _response_downloaded(self, response):        rule = self._rules[response.meta['rule']]        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
  • 1
  • 2
  • 3
  • 1
  • 2
  • 3
  • allow :符合这个url我就爬取
  • deny : 符合这个url规则我就放弃
  • allow_domin : 这个域名下的我才处理
  • allow_domin : 这个域名下的我不处理
  • restrict_xpaths:进一步限定xpath
self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),                 tags=('a', 'area'), attrs=('href',), canonicalize=True,                 unique=True, process_value=None, deny_extensions=None, restrict_css=()
  • 1
  • 2
  • 3
  • 1
  • 2
  • 3

extract_links 
如果有restrict_xpaths,他会进行读取执行

    def extract_links(self, response):        base_url = get_base_url(response)        if self.restrict_xpaths:            docs = [subdoc                    for x in self.restrict_xpaths                    for subdoc in response.xpath(x)]        else:            docs = [response.selector]        all_links = []        for doc in docs:            links = self._extract_links(doc, response.url, response.encoding, base_url)            all_links.extend(self._process_links(links))        return unique_list(all_links)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

get_base_url:

urllib.parse.urljoin替我们拼接好url

def get_base_url(text, baseurl='', encoding='utf-8'):    """Return the base url if declared in the given HTML `text`,    relative to the given base url.    If no base url is found, the given `baseurl` is returned.    """    text = to_unicode(text, encoding)    m = _baseurl_re.search(text)    if m:        return moves.urllib.parse.urljoin(            safe_url_string(baseurl),            safe_url_string(m.group(1), encoding=encoding)        )    else:        return safe_url_string(baseurl)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

编写rule规则

    rules = (        Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),        Rule(LinkExtractor(allow=("gongsi/j\d+.html",)), follow=True),        Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),    )
  1. 我们编写的spider,然后yield一个request发送给engine
  2. engine拿到什么都不做然后给scheduler
  3. engine会生成一个request给engine
  4. engine拿到之后通过downloadermiddleware 给downloader
  5. downloader再发送response回来给engine。
  6. engine拿到之后,response给spider。
  7. spider进行处理,解析出item & request,
  8. item->给itempipeline;如果是request,跳转步骤二

path:articlespider3\Lib\site-packages\scrapy\core

  • engine.py:
  • scheduler.py
  • downloader

  • item

  • pipeline
  • spider

engine.py:重要函数schedule

  1. enqueue_request:把request放scheduler
  2. _next_request_from_scheduler:从调度器拿。
    def schedule(self, request, spider):        self.signals.send_catch_log(signal=signals.request_scheduled,                request=request, spider=spider)        if not self.slot.scheduler.enqueue_request(request):            self.signals.send_catch_log(signal=signals.request_dropped,                                        request=request, spider=spider)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

articlespider3\Lib\site-packages\scrapy\core\downloader\handlers

支持文件,ftp,http下载(https).

后期定制middleware:

  • spidermiddlewire
  • downloadmiddlewire

django和scrapy结构类似

3. scrapy的两个重要类:request和response

类似于django httprequest

yield Request(url=parse.urljoin(response.url, post_url))
  • 1
  • 1

request参数:

class Request(object_ref):    def __init__(self, url, callback=None, method='GET', headers=None, body=None,                 cookies=None, meta=None, encoding='utf-8', priority=0,                 dont_filter=False, errback=None):
  • 1
  • 2
  • 3
  • 4
  • 5
  • 1
  • 2
  • 3
  • 4
  • 5

cookies: 
Lib\site-packages\scrapy\downloadermiddlewares\cookies.py

cookiejarkey = request.meta.get("cookiejar")
  • 1
  • 1
  • priority: 优先级,影响调度顺序
  • dont_filter:我的同样的request不会被过滤
  • errback:错误时的回调函数

https://doc.scrapy.org/en/1.2/topics/request-response.html?highlight=response

errback example:

class ErrbackSpider(scrapy.Spider):    name = "errback_example"    start_urls = [        "http://www.httpbin.org/",              # HTTP 200 expected        "http://www.httpbin.org/status/404",    # Not found error        "http://www.httpbin.org/status/500",    # server issue        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected        "http://www.httphttpbinbin.org/",       # DNS error expected    ]    def start_requests(self):        for u in self.start_urls:            yield scrapy.Request(u, callback=self.parse_httpbin,                                    errback=self.errback_httpbin,                                    dont_filter=True)    def parse_httpbin(self, response):        self.logger.info('Got successful response from {}'.format(response.url))        # do something useful here...    def errback_httpbin(self, failure):        # log all failures        self.logger.error(repr(failure))        # in case you want to do something special for some errors,        # you may need the failure's type:        if failure.check(HttpError):            # these exceptions come from HttpError spider middleware            # you can get the non-200 response            response = failure.value.response            self.logger.error('HttpError on %s', response.url)        elif failure.check(DNSLookupError):            # this is the original request            request = failure.request            self.logger.error('DNSLookupError on %s', request.url)        elif failure.check(TimeoutError, TCPTimedOutError):            request = failure.request            self.logger.error('TimeoutError on %s', request.url)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41

response类

 def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None):        self.headers = Headers(headers or {})
  • 1
  • 2
  • 1
  • 2

response的参数: 
request:yield出来的request,会放在response,让我们知道它是从哪里来的