[源码学习][知了开发]WebMagic四大组件-Scheduler
来源:互联网 发布:如何防范网络诈骗ppt 编辑:程序博客网 时间:2024/06/05 19:22
写在前面
先看看文档对于Scheduler的作用介绍
https://code4craft.gitbooks.io/webmagic-in-action/content/zh/posts/ch1-overview/architecture.html
之前我们也介绍过了,Scheduler主要负责爬虫的下一步爬取的规划,包括一些去重等功能。在主流程中也看到了Scheduler,现在来具体结合源码分析
源码
Scheduler是一个接口
public interface Scheduler { /** * add a url to fetch * * @param request * @param task */ public void push(Request request, Task task); /** * get an url to crawl * * @param task the task of spider * @return the url to crawl */ public Request poll(Task task);}
其主要的实现是DuplicateRemovedScheduler,使用模板模式定义了push的步骤。
public abstract class DuplicateRemovedScheduler implements Scheduler { protected Logger logger = LoggerFactory.getLogger(getClass()); private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover(); public DuplicateRemover getDuplicateRemover() { return duplicatedRemover; } public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) { this.duplicatedRemover = duplicatedRemover; return this; } @Override public void push(Request request, Task task) { logger.trace("get a candidate url {}", request.getUrl()); if (!duplicatedRemover.isDuplicate(request, task) || shouldReserved(request)) { logger.debug("push to queue {}", request.getUrl()); pushWhenNoDuplicate(request, task); } } protected boolean shouldReserved(Request request) { return request.getExtra(Request.CYCLE_TRIED_TIMES) != null; } protected void pushWhenNoDuplicate(Request request, Task task) { }}
我们来看看负责去重的接口DuplicateRemover,其实现类有HashSetDuplicateRemover使用HashSet来去重,RedisScheduler接触Redis来去重和BloomFilterDuplicateRemover使用BloomFilter去重。默认使用HashSetDuplicateRemover
public class HashSetDuplicateRemover implements DuplicateRemover { private Set<String> urls = Sets.newSetFromMap(new ConcurrentHashMap<String, Boolean>()); @Override public boolean isDuplicate(Request request, Task task) { return !urls.add(getUrl(request)); } protected String getUrl(Request request) { return request.getUrl(); } @Override public void resetDuplicateCheck(Task task) { urls.clear(); } @Override public int getTotalRequestsCount(Task task) { return urls.size(); }}
DuplicateRemovedScheduler抽象类有四个具体实现类QueueScheduler,PriorityScheduler,FileCacheQueueScheduler和RedisScheduler。默认使用QueueScheduler
@ThreadSafepublic class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler { private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>(); @Override public void pushWhenNoDuplicate(Request request, Task task) { queue.add(request); } @Override public synchronized Request poll(Task task) { return queue.poll(); } @Override public int getLeftRequestsCount(Task task) { return queue.size(); } @Override public int getTotalRequestsCount(Task task) { return getDuplicateRemover().getTotalRequestsCount(task); }}
其内部是使用了一个LinkedBlockingQueue这个无界队列来存储Request,我们应该看到了@ThreadSafe注解,那我抛一个问题吧。Scheduler是否存在线程同步问题呢,如果存在那是如何解决的呢?
再来看下一个
@ThreadSafepublic class PriorityScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler { public static final int INITIAL_CAPACITY = 5; private BlockingQueue<Request> noPriorityQueue = new LinkedBlockingQueue<Request>(); private PriorityBlockingQueue<Request> priorityQueuePlus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() { @Override public int compare(Request o1, Request o2) { return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority()); } }); private PriorityBlockingQueue<Request> priorityQueueMinus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() { @Override public int compare(Request o1, Request o2) { return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority()); } }); @Override public void pushWhenNoDuplicate(Request request, Task task) { if (request.getPriority() == 0) { noPriorityQueue.add(request); } else if (request.getPriority() > 0) { priorityQueuePlus.put(request); } else { priorityQueueMinus.put(request); } } @Override public synchronized Request poll(Task task) { Request poll = priorityQueuePlus.poll(); if (poll != null) { return poll; } poll = noPriorityQueue.poll(); if (poll != null) { return poll; } return priorityQueueMinus.poll(); } @Override public int getLeftRequestsCount(Task task) { return noPriorityQueue.size(); } @Override public int getTotalRequestsCount(Task task) { return getDuplicateRemover().getTotalRequestsCount(task); }}
我们看到了两个PriorityBlockingQueue和一个LinkedBlockingQueue。在poll的时候存在一个顺序。
继续
public class FileCacheQueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler { private String filePath = System.getProperty("java.io.tmpdir"); private String fileUrlAllName = ".urls.txt"; private Task task; private String fileCursor = ".cursor.txt"; private PrintWriter fileUrlWriter; private PrintWriter fileCursorWriter; private AtomicInteger cursor = new AtomicInteger(); private AtomicBoolean inited = new AtomicBoolean(false); private BlockingQueue<Request> queue; private Set<String> urls; public FileCacheQueueScheduler(String filePath) { if (!filePath.endsWith("/") && !filePath.endsWith("\\")) { filePath += "/"; } this.filePath = filePath; } private void flush() { fileUrlWriter.flush(); fileCursorWriter.flush(); } private void init(Task task) { this.task = task; File file = new File(filePath); if (!file.exists()) { file.mkdirs(); } readFile(); initWriter(); initFlushThread(); inited.set(true); logger.info("init cache scheduler success"); } private void initFlushThread() { Executors.newScheduledThreadPool(1).scheduleAtFixedRate(new Runnable() { @Override public void run() { flush(); } }, 10, 10, TimeUnit.SECONDS); } private void initWriter() { try { fileUrlWriter = new PrintWriter(new FileWriter(getFileName(fileUrlAllName), true)); fileCursorWriter = new PrintWriter(new FileWriter(getFileName(fileCursor), false)); } catch (IOException e) { throw new RuntimeException("init cache scheduler error", e); } } private void readFile() { try { queue = new LinkedBlockingQueue<Request>(); urls = new LinkedHashSet<String>(); readCursorFile(); readUrlFile(); } catch (FileNotFoundException e) { //init logger.info("init cache file " + getFileName(fileUrlAllName)); } catch (IOException e) { logger.error("init file error", e); } } private void readUrlFile() throws IOException { String line; BufferedReader fileUrlReader = null; try { fileUrlReader = new BufferedReader(new FileReader(getFileName(fileUrlAllName))); int lineReaded = 0; while ((line = fileUrlReader.readLine()) != null) { urls.add(line.trim()); lineReaded++; if (lineReaded > cursor.get()) { queue.add(new Request(line)); } } } finally { if (fileUrlReader != null) { IOUtils.closeQuietly(fileUrlReader); } } } private void readCursorFile() throws IOException { BufferedReader fileCursorReader = null; try { fileCursorReader = new BufferedReader(new FileReader(getFileName(fileCursor))); String line; //read the last number while ((line = fileCursorReader.readLine()) != null) { cursor = new AtomicInteger(NumberUtils.toInt(line)); } } finally { if (fileCursorReader != null) { IOUtils.closeQuietly(fileCursorReader); } } } private String getFileName(String filename) { return filePath + task.getUUID() + filename; } @Override protected void pushWhenNoDuplicate(Request request, Task task) { if (!inited.get()) { init(task); } queue.add(request); fileUrlWriter.println(request.getUrl()); } @Override public synchronized Request poll(Task task) { if (!inited.get()) { init(task); } fileCursorWriter.println(cursor.incrementAndGet()); return queue.poll(); } @Override public int getLeftRequestsCount(Task task) { return queue.size(); } @Override public int getTotalRequestsCount(Task task) { return getDuplicateRemover().getTotalRequestsCount(task); }}
会将url和已经执行的url指针存在两个文件中,创建了scheduleExecutor定期的flush,所有内存中的url还是存在BlockingQueue中。
RedisScheduler不是很懂。。目前还没有接触过:)
使用
具体使用过程还是需要自己根据自己的爬虫特点然后选择特定的Scheduler及DuplicateRemover,只有懂得其原理才能选择最合适的组件。
WebMagic组件都可以自行设置这点真的太棒了~
- [源码学习][知了开发]WebMagic四大组件-Scheduler
- [源码学习][知了开发]WebMagic四大组件-Downloader,Pipeline,PageProcesser
- [源码学习][知了开发]WebMagic-CountableThreadPool&SpiderMonitor
- [源码学习][知了开发]WebMagic-OOSpider
- [源码学习][知了开发]WebMagic-总体流程源码分析
- [知了开发]“知了”优化 - WebMagic 调优
- Android开发学习之四大组件
- Android开发之四大组件学习
- kubernetes调度组件kube-scheduler源码分析
- kube-scheduler 组件源码阅读笔记
- Android开发四大组件
- Android开发四大组件
- Android开发,四大组件
- Android开发四大组件
- Android开发四大组件
- Android开发四大组件
- Android四大组件学习
- Android开发学习之四大组件之一 --- ContentProvider
- 杭电OJ2203-亲和串
- EL表达式/ JSTL标准标签库
- linux多线程的使用
- 栈区和堆区内存分配的区别
- 链表实现冒泡排序
- [源码学习][知了开发]WebMagic四大组件-Scheduler
- STL set容器 基本运用
- 不可思议、违反直觉
- (三)Spring框架——IoC容器
- TCP UDP及socket编程学习(一)
- RMQ (st表) Balanced Lineup
- (四)Spring框架——Bean的定义
- 会话技术 cookie和session 学习笔记
- Spark分区器HashPartitioner和RangePartitioner代码详解