java爬虫 爬取图书信息
来源:互联网 发布:感性和理性 知乎 编辑:程序博客网 时间:2024/06/05 07:13
该程序是爬取京东上的Java图书信息
book模型:
private String bookID; private String bookName; private String bookPrice;
文件结构
1)httpclient maven配置:(不同版本创建HttpClient方法不同)
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.1.2</version></dependency>
2)main方法:(获取数据,存放数据)
public class bookMain { static final Log logger = LogFactory.getLog(bookMain.class); //log4j public static void main(String[] args) throws Exception { HttpClient httpclient = new DefaultHttpClient(); //创建HttpClient String url = "https://search.jd.com/Search?keyword=java&enc=utf-8&wq=java&pvid=f961dczi.8r5joc"; //种子 List<Book> books = URLEntity.URLParse(httpclient, url); //通过URLEntity获取实体中的信息 for (Book book : books) { logger.info("bookId:" + book.getBookID() + "\t" + "bookName:" + book.getBookName() + "\t" + "bookPrice:" + book.getBookPrice() + "\t"); } mysql_control.executeInsert(books); //数据库添加数据 }}
3)获取response(httpUtil类)
public class httpUtil { public static HttpResponse getHtml(HttpClient httpclient, String url) throws IOException { HttpGet getMethod = new HttpGet(url); /get方法 HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,HttpStatus.SC_OK,"ok"); //response初始化 response = httpclient.execute(getMethod); //执行get方法 return response; }}
4)返回实体中的信息(URLEntity类)
调用3)获取response
public class URLEntity { public static List<Book> URLParse(HttpClient httpclient,String url) throws IOException { List<Book> getbooks = new ArrayList<Book>(); HttpResponse response = httpUtil.getHtml(httpclient, url); int statusCode = response.getStatusLine().getStatusCode(); //获取状态码 if(statusCode == 200) //200为正常 { String entity = EntityUtils.toString(response.getEntity(),"utf-8"); getbooks = bookParse.getData(entity); EntityUtils.consume(response.getEntity()); //消耗实体类,实体类最后需要消耗 } else EntityUtils.consume(response.getEntity()); return getbooks; }}
5)解析html(此处使用的是jsoup)bookParse类
public class bookParse { public static List<Book> getData(String html) { List<Book> datas = new ArrayList<Book>(); Document doc = Jsoup.parse(html); Elements elements = doc.select("ul[class=gl-warp clearfix]").select("li[class=gl-item]"); for (Element element : elements) { String bookid = element.select("div[class=gl-i-wrap j-sku-item]").attr("data-sku"); String bookprice = element.select("div[class=p-price]").select("strong").select("i").text(); String bookname = element.select("div[class=p-name]").select("em").text(); Book book = new Book(); book.setBookID(bookid); book.setBookName(bookname); book.setBookPrice(bookprice); datas.add(book); } return datas; }}
6)此处的数据库连接
mysql_source类
public class mysql_source { public static DataSource getDataSource(String connectURI) { BasicDataSource ds = new BasicDataSource(); ds.setDriverClassName("com.mysql.jdbc.Driver"); ds.setUsername("root"); ds.setPassword("wodemima"); ds.setUrl(connectURI); return ds; }}
mysql_control类
public class mysql_control { static DataSource ds = mysql_source.getDataSource("jdbc:mysql://127.0.0.1:3306/book"); static QueryRunner qr = new QueryRunner(ds); public static void executeInsert(List<Book> bookdatas) throws SQLException { Object[][] params = new Object[bookdatas.size()][3]; for(int i=0; i<params.length; i++) { params[i][0] = bookdatas.get(i).getBookID(); params[i][1] = bookdatas.get(i).getBookName(); params[i][2] = bookdatas.get(i).getBookPrice(); } qr.batch("insert into books(bookID,bookNam,bookPrice)values(?,?,?)", params); System.out.println("成功插入" + bookdatas.size() + "条"); }}
7)补充
log4j.properties文件内容
log4j.rootLogger=DEBUG, stdoutlog4j.appender.stdout=org.apache.log4j.ConsoleAppenderlog4j.appender.stdout.layout=org.apache.log4j.PatternLayoutlog4j.appender.stdout.layout.ConversionPattern=%-5p - %m%n
只是输出到控制台
控制台运行结果
0 0
- java爬虫 爬取图书信息
- Python爬虫入门 | 4 爬取豆瓣TOP250图书信息
- Python爬虫爬取豆瓣图书的信息和封面,放入MySQL数据库中。
- 爬取豆瓣图书Top250书籍信息
- 爬取当当网图书信息
- 爬虫爬取股票信息
- java实现爬虫,爬取网易歌单信息
- <Python爬虫>爬取豆瓣图书/豆瓣电影系列
- 爬虫爬取信息存入数据库
- 爬虫爬取页面信息及图片链接
- 爬虫之爬取基金信息
- scrapy爬取豆瓣读书的图书信息
- Python爬取当当、京东、亚马逊图书信息
- Python爬取豆瓣图书信息学习记录
- python爬虫之豆瓣图书信息几行字
- java实现简单的网络爬虫(爬取电影天堂电影信息)
- Java jsoup多线程爬虫(爬豆瓣图书封面)
- java之爬虫:爬取网页源代码
- 从url中提取文件的扩展名
- Labview 边缘检测及模板匹配
- HBase知识总结
- js零碎知识(长期更新)
- MFC加载EXCEL代码
- java爬虫 爬取图书信息
- 【BZOJ 2257】【JSOI 2009】瓶子和燃料 【裴蜀定理】
- Maximum Subarray
- final、static、static final修饰引用类型时的区别
- Android根据系统策略选择GPS定位
- Linux驱动attribute
- 3、(lodash_gcy)移除数组中的假值元素
- numpy数组切片与索引
- 图像特征提取算法