WebMagic爬虫案例
来源:互联网 发布:安装win7无法连接网络 编辑:程序博客网 时间:2024/05/22 14:24
使用Maven导入以下两个包:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.5.2</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.5.2</version>
</dependency>
这次弄了两个小案例,都是爬的小说网,第一个是起点的列表页
用firebug我们可以看到:
此时用WebMagic注解方式即可,方便简单:
package com.zab.webmagic;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.model.ConsolePageModelPipeline;import us.codecraft.webmagic.model.OOSpider;import us.codecraft.webmagic.model.annotation.ExtractBy;import us.codecraft.webmagic.model.annotation.TargetUrl;@TargetUrl("http://a.qidian.com/")@ExtractBy(value = "//ul[@class=\"all-img-list cf\"]/li",multi = true)public class GithubRepoPageProcessor { @ExtractBy("//div[@class=book-mid-info]/h4/a/text()") private String title; @ExtractBy("//div[@class=book-mid-info]/p[@class=author]/a[@class=name]/text()") private String author; @ExtractBy("//div[@class=book-mid-info]/p[@class=author]/a[@class=go-sub-type]/text()") private String type; @ExtractBy("//div[@class=book-mid-info]/p[@class=author]/span/text()") private String status; @ExtractBy("//div[@class=book-mid-info]/p[@class=intro]/text()") private String intro; @ExtractBy("//div[@class=book-mid-info]/p[@class=update]/span/text()") private String count; public static void main(String[] args) {// OOSpider.create(Site.me(), new ConsolePageModelPipeline(), Qidian.class).addUrl("http://a.qidian.com/").thread(4).run(); OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), new ConsolePageModelPipeline(), GithubRepoPageProcessor.class); GithubRepoPageProcessor qidian= ooSpider.get("http://a.qidian.com/"); System.out.println(qidian); }}
此时,我们就已经得到结果了
第二个例子是创世网的列表页:
同样的用firebug查看:
代码为:
package com.zab.webmagic;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.model.ConsolePageModelPipeline;import us.codecraft.webmagic.model.OOSpider;import us.codecraft.webmagic.model.annotation.ExtractBy;import us.codecraft.webmagic.model.annotation.TargetUrl;@TargetUrl("http://chuangshi.qq.com/bk/")@ExtractBy(value = "//div[@class='leftlist']/table/tbody/tr",multi = true)public class ChuangShi { @ExtractBy("//a[@class=green]/text()") private String title; @ExtractBy("//a[@class=grey3]/text()") private String author; @ExtractBy("//a[@class=grey2]/text()") private String type; public static void main(String[] args) {// OOSpider.create(Site.me(), new ConsolePageModelPipeline(), Qidian.class).addUrl("http://a.qidian.com/").thread(4).run(); OOSpider ooSpider = OOSpider.create(Site.me().setCharset("utf-8"), new ConsolePageModelPipeline(), ChuangShi.class); ChuangShi qidian= ooSpider.get("http://chuangshi.qq.com/bk/"); System.out.println(qidian); }}
结果显示:
0 0
- WebMagic爬虫案例
- java爬虫案例--webmagic
- webmagic爬虫
- webmagic爬虫
- webmagic爬虫程序
- WebMagic/JMX&爬虫监控
- WebMagic 爬虫框架学习
- Java爬虫(webmagic)
- java 爬虫框架 webmagic
- webmagic爬虫使用
- 基于WebMagic爬虫
- WebMagic爬虫框架学习
- java爬虫技术--webmagic
- webmagic爬虫讲解
- Java爬虫-webmagic
- 使用注解编写WebMagic爬虫
- WebMagic Java爬虫框架初探
- java 爬虫 WebMagic-使用入门
- 画图基础功能的详解
- MySql的基本命令
- oracle归档日志写满错误解决方法
- 【Educational Codeforces Round 10E】【双连通分量缩环 BFS】Pursuit For Artifacts ★
- 线程间通信 wait() notify()
- WebMagic爬虫案例
- 编写自己的CORDIC IP CORE
- gcc 编译总结
- 最常用的PHP正则表达式收集整理
- android studio安装svn插件
- 2.1.4 装饰者模式
- 数据库基本操作
- Object c的点语法
- HDU 5748 Bellovin(dp+二分)