scrapy-redis初次见面
来源:互联网 发布:卡蒙刷q币软件 编辑:程序博客网 时间:2024/06/18 13:32
第一次使用scrapy-redis,之前总是听说,但是没有敢尝试,今天自己做了第一个小爬虫.
spider
# -*- coding: utf-8 -*-from scrapy import Spider, Request,FormRequestfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom elemme.items import ElemmeItemfrom .redis_pie import BloomFilterclass Eleme(CrawlSpider): name = 'spider' allowed_domains = ['douban.com'] start_urls = ['https://book.douban.com/'] bf = BloomFilter() rules = ( Rule(LinkExtractor(allow=(r'https://book.douban.com/tag',)),callback='parse_shop',follow=True), ) def parse_shop(self,response): if not self.bf.isContains(response.url): self.bf.insert(duplicate_url) try: for book in response.xpath('//*[@class="subject-list"]/li'): book_name = book.xpath('div[@class="info"]/h2/a/text()').extract()[0].strip() book_author = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split()[0] book_price = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split()[-1] rating_nums = book.xpath('div[@class="info"]/div[@class="star clearfix"]/span[@class="rating_nums"]/text()').extract()[0].strip() book_pj = book.xpath('div[@class="info"]/div[@class="star clearfix"]/span[@class="pl"]/text()').extract()[0].strip() item = ElemmeItem() for field in item.fields: item[field] = eval(field) yield item except: pass else: break
布隆过滤器
# encoding=utf-8import redisfrom hashlib import md5from elemme.settings import *class SimpleHash(object): def __init__(self, cap, seed): self.cap = cap self.seed = seed def hash(self, value): ret = 0 for i in range(len(value)): ret += self.seed * ret + ord(value[i]) return (self.cap - 1) & retclass BloomFilter(object): def __init__(self, host=redis_host, password=redis_pwd, port=6379, db=bl_db, blockNum=1, key='bloomfilter'): """ :param host: the host of Redis :param port: the port of Redis :param db: witch db in Redis :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it. :param key: the key's name in Redis """ self.server = redis.Redis(host=host, password=password, port=port, db=db) self.bit_size = 1 << 31 # Redis的String类型最大容量为512M,现使用256M self.seeds = [5, 7, 11, 13, 31, 37, 61] self.key = key self.blockNum = blockNum self.hashfunc = [] for seed in self.seeds: self.hashfunc.append(SimpleHash(self.bit_size, seed)) def isContains(self, str_input): if not str_input: return False if isinstance(str_input, str): str_input = str_input.encode('utf-8') m5 = md5() m5.update(str_input) str_input = m5.hexdigest() ret = True name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) ret = ret & self.server.getbit(name, loc) return ret def insert(self, str_input): if isinstance(str_input, str): str_input = str_input.encode('utf-8') m5 = md5() m5.update(str_input) str_input = m5.hexdigest() name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) self.server.setbit(name, loc, 1)
我没有做分布式 只是做了个布隆过滤器的去重
还有我爬取的数据
阅读全文
0 0
- scrapy-redis初次见面
- scrapy-redis 和 scrapy ?
- scrapy-redis
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- 初次见面
- scrapy-redis(七):部署scrapy
- Scrapy-redis分析
- scrapy-redis源码分析
- 解决Cannot change version of project facet Dynamic web module to 2.5
- python paranitm上传文件失败IO Permission
- 分治算法 简析及举例
- AR事务处理金额获取
- 配置Android交叉编译工具链环境变量
- scrapy-redis初次见面
- 1111DOM4J
- 后缀数组SA
- 求助:R语言sunburst函数
- C++读取txt数据为二维数组 将数据保存到txt文本中
- 数据库基础
- wireshark抓包工具
- [IDEA]运行BUG集锦
- eclipse-ini配置文件