爬取智联招聘信息

来源:互联网 发布:怎么使用网络云盘 编辑:程序博客网 时间:2024/04/26 14:27

爬取计划:每种职业计划爬取30页

页数判断:

                         定位这个来判断,下方的30
上一页1....28293031下一页


定位这个进入工作的详细信息页面:

PHP工程师PHP实习生应届生均可PHP软件开发工程师PHP工程师PHP工程师PHP工程师

jobs = response.css("td.zwmc>div>a")


解析从myspider:start_urls处返回的response:

def parse(self,response):

             1.判断页数

             2.解析页面

                    (i)提取到的jobs的url

                     (ii)产生request,跳转到parsejob()函数,进行下一步的处理

                     (iii)提取下一页的url,并产生request


def parsejob(self,response):

            1.提取有关job的详细信息


scrapy爬虫观察:                                                            

                                                                                              这个是重点,产生了一个请求

      [scrapy.core.engine]    DEBUG:  Crawed(200)   <GET  http://..............>

                                                                    

                                                                                             这个应该是要解析的

      [scrapy.core.scraper]    DEBUG:   Scraped from  <200    http://..............>


使用爬虫是遇到的情况:

        测试条件:

                       Download Delay = 5

                       在无request情况下,本地爬虫产生一个错误raise  NotImplementError

                       原因:注释掉parse(self,response):函数

                                    只保留parsejob(self,response):函数负责处理response

                       结果:

                                    爬虫并未停止运行,因为内部机制有个叫爬虫闲置(具体名字忘了,下次见到补上),专门应对这种分布式情况,下面这段话可以解释

                                   
 # Max idle time to prevent the spider from being closed when distributed crawling. # This only works if queue class is SpiderQueue or SpiderStack, # and may also block the same time when your spider start at the first time (because the queue is empty). #SCHEDULER_IDLE_BEFORE_CLOSE = 10

爬虫代码:

from scrapy_redis.spiders import RedisSpiderimport scrapyclass MySpider(RedisSpider):    name = "zhilian"    redis_key = "zhilian:start_urls"    allowed_domains = ["jobs.zhaopin.com","sou.zhaopin.com"]    def parse(self,response):        pagenum = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[6]/a/text()").extract_first()        if int(pagenum) <= 30:            jobsurl = response.css("td.zwmc>div>a::attr(href)").extract()            for joburl in jobsurl:                yield scrapy.Request(joburl,callback=self.parsejob)            nextPage = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[11]/a/@href").extract_first()            yield scrapy.Request(nextPage,callback=self.parse)    def parsejob(self,response):        yield {            'jobname':response.xpath("//body/div[5]/div[1]/div[1]/h1/text()").extract_first(),        }


0 0