网页爬虫

来源:互联网 发布:c语言 map 编辑:程序博客网 时间:2024/05/16 09:36

 一. 前言 


  最近要测试修改一个反爬虫代码, 之前一直没接触过反爬虫, 只闻其声不见其人。 


  既然要反爬虫, 肯定要理解爬虫的思维方式, 见招拆招, 不过遗憾的是只要你想爬没啥爬不到的, 比如控制下爬取频率, 用无数个代理小量多次爬取, 反爬虫只能说是尽量增加一些爬取的门槛吧, 至少把一些练手的小爬虫(比如现在这个小菜鸡爬虫)挡在外面, 减少些负载。



二. 设计思路


(1)一个收集所需网页全站或者指定子域名的链接队列

(2)一个存放将要访问的URL队列(跟上述有点重复, 用空间换时间, 提升爬取速度)

(3)一个保存已访问过URL的数据结构


数据结构有了, 接下来就是算法了, 一般推荐采取广度优先的爬取算法, 免得中了反爬虫的某些循环无限深度的陷阱。


使用了 jsoup (一个解析HTML元素的Lib)和 httpclient (网络请求包)来简化代码实现。



三. 代码实现


上述三种数据结构:

// 已爬取URL  <URL, isAccess>    final static ConcurrentHashMap<String, Boolean> urlQueue = new ConcurrentHashMap<String, Boolean>();    // 待爬取URL    final static ConcurrentLinkedDeque<String> urlWaitingQueue = new ConcurrentLinkedDeque<String>();    // 待扫描网页URL队列    final static ConcurrentLinkedDeque<String> urlWaitingScanQueue = new ConcurrentLinkedDeque<String>();


入队等待:

/**     * url store in the waiting queue     * @param originalUrl     * @throws Exception     */    private static void enterWaitingQueue(final String originalUrl) throws Exception{        String url = urlWaitingScanQueue.poll();        // if accessed, ignore the url        /*while (urlQueue.containsKey(url)) {            url = urlWaitingQueue.poll();        }*/        final String finalUrl = url;        Thread.sleep(600);        new Thread(new Runnable() {            public void run() {                try{                    if (finalUrl != null) {                        Connection conn = Jsoup.connect(finalUrl);                        Document doc = conn.get();                        //urlQueue.putIfAbsent(finalUrl, Boolean.TRUE); // accessed                        logger.info("扫描网页URL: " + finalUrl);                        Elements links = doc.select("a[href]");                        for (int linkNum = 0; linkNum < links.size(); linkNum++) {                            Element element = links.get(linkNum);                            String suburl = element.attr("href");                            // 某条件下, 并且原来没访问过                            if (!urlQueue.containsKey(suburl)) {                                    urlWaitingScanQueue.offer(suburl);                                    urlWaitingQueue.offer(suburl);                                    logger.info("URL入队等待" + linkNum + ": " + suburl);                                }                            }                        }                    }                } catch (Exception ee) {                    logger.error("muti thread executing error, url: " + finalUrl, ee);                }            }        }).start();    }


访问页面:

private static void viewPages() throws Exception{        Thread.sleep(500);        new Thread(new Runnable() {            @Override            public void run() {                try {                    while(!urlWaitingQueue.isEmpty()) {                        String url = urlWaitingQueue.peek();                        final String finalUrl = url;                        // build a client, like open a browser                        CloseableHttpClient httpClient = HttpClients.createDefault();                        // create get method, like input url in the browser                        //HttpGet httpGet = new HttpGet("http://www.dxy.cn");                        HttpPost httpPost = new HttpPost(finalUrl);                        StringBuffer stringBuffer = new StringBuffer();                        HttpResponse response;                        //List<NameValuePair> keyValue = new ArrayList<NameValuePair>();                        //  Post parameter                        //            keyValue.add(new BasicNameValuePair("username", "zhu"));                        //                        //            httpPost.setEntity(new UrlEncodedFormEntity(keyValue, "UTF-8"));                        // access and get response                        response = httpClient.execute(httpPost);                        // record access URL                        urlQueue.putIfAbsent(finalUrl, Boolean.TRUE);                        if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {                            HttpEntity httpEntity = response.getEntity();                            if (httpEntity != null) {                                logger.info("viewPages访问URL:" + finalUrl);                                BufferedReader reader = new BufferedReader(                                        new InputStreamReader(httpEntity.getContent(), "UTF-8"));                                String line = null;                                if (httpEntity.getContentLength() > 0) {                                    stringBuffer = new StringBuffer((int) httpEntity.getContentLength());                                    while ((line = reader.readLine()) != null) {                                        stringBuffer.append(line);                                    }                                    System.out.println(finalUrl + "内容: " + stringBuffer);                                }                            }                        }                    }                } catch (Exception e) {                    logger.error("view pages error", e);                }            }        }).start();    }



三. 总结及将来要实现功能


以上贴出了简易版Java爬虫的核心实现模块, 基本上拿起来就能测试。


比如前言所提到的控制爬取速度(调度模块), 使用代理IP访问(收集网络代理模块)的实现在后续版本中会慢慢加上...





4 1
原创粉丝点击