多账号定向网站爬取关键技术

来源:互联网 发布:mac air 装单win详细 编辑:程序博客网 时间:2024/06/05 00:54

场景描述:

多个帐号模拟登录同一个网站,将不同帐号的cookie保存至本地,对于网站中的异步接口数据,各个帐号(通过携带cookie)都可同时并行请求。


问题描述:

假设有不同帐号,100个线程的爬取请求的cookie失效时,某个线程将会占有唯一浏览器,重新登录网站。其它线程如何处理?


1 爬取线程不应该直接刷新cookie,而是提交刷新cookie的需求

2 由cookie管理器控制哪个请求线程具体占有浏览器实现登录,获取cookie


cookie刷新观察者,监听cookie刷新的需求,并实际执行cookie刷新

public class CookiesAskWatcher implements Observer{

    private CookiesFlushSemaphore cookiesFlushSemaphore;

    CookiesAskWatcher(CookiesFlushSemaphore cookiesFlushSemaphore){
        this.cookiesFlushSemaphore = cookiesFlushSemaphore;
    }

    @Override
    public void update(Observable o, Object arg) {        
       CookiesStoreUtils.flush(帐号,密码);
       cookiesFlushSemaphore.confirmFlush();       
    }
}


cookie刷新的请求的被观察者,根据情况提交刷行请求,或者阻塞,唤醒后直接使用可用的cookie

public class CookiesAskWatched extends Observable{

    private boolean volatile isFlushing = false;

    private CookiesFlushSemaphore cookiesFlushSemaphore;

    public CookiesAskWatched(CookiesFlushSemaphore cookiesFlushSemaphore){
        this.cookiesFlushSemaphore = cookiesFlushSemaphore;
    }

    public void askFlush(Param param) throws Exception{
        if(isFlushing == false){
            synchronized (this) {
                if (isFlushing == false) {
                    isFlushing = true;
                    setChanged();
                    notifyObservers(param);
                    isFlushing = false;
                    return;
                }
            }
        }
        cookiesFlushSemaphore.waitFlush();
    }
}


cookie管理者,初始化需要定义cookie观察者和主题

public class CookiesManager {

    private static Map<Integer, CookiesAskWatched> watcheds = new HashMap<>();

    public CookiesManager(){
        int i = 1;
        while (i <= 4){
            CookiesFlushSemaphore cookiesFlushSemaphore = new CookiesFlushSemaphore();
            CookiesAskWatched cookiesAskWatched = new CookiesAskWatched(cookiesFlushSemaphore);
            cookiesAskWatched.addObserver(new CookiesAskWatcher(cookiesFlushSemaphore));
            watcheds.put(i, cookiesAskWatched);
            i++;
        }
    }

    public void submitFlush(Param param) throws Exception{
        Integer masterCard = param.getMasterCard();
        watcheds.get(masterCard).askFlush(param);
    }

信号量,换起阻塞的线程

public class CookiesFlushSemaphore {

    private boolean signal = false;
   

    public synchronized void waitFlush() throws InterruptedException{

        this.signal = false;

        while (!this.signal)wait();
    }

    public synchronized void confirmFlush(){
        this.signal = true;
        this.notifyAll();
    }

}


原创粉丝点击