编写网络爬虫获取饿了么商家信息(一)

来源:互联网 发布:微博伪造软件 编辑:程序博客网 时间:2024/04/28 20:12

利用HttpClient和Jsoup两种工具分别进行爬取数据

maven坐标:

<dependency><groupId>commons-httpclient</groupId><artifactId>commons-httpclient</artifactId><version>3.1</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.2</version><scope>runtime</scope></dependency>


要爬取的页面:



利用谷歌Chrome进行网络信息监控



发现前台响应的数据来自后台返回的json格式,所以只需要访问数据请求的url即可。

url :https://www.ele.me/restapi/shopping/restaurants?extras%5B%5D=activities&geohash=wsb0ujx0pu4&latitude=26.88082&limit=24&longitude=112.68573&offset=0&terminal=web

点开始json格式的乱码。下面开始请求:

HttpClient:

package com.yc.elm.utils;import org.apache.commons.httpclient.HttpClient;import org.apache.commons.httpclient.HttpMethod;import org.apache.commons.httpclient.methods.GetMethod;public class GetDate {public static void main(String[] args) throws Exception {String url = "https://www.ele.me/restapi/shopping/restaurants"+ "?extras%5B%5D=activities&geohash=wsb0ujx0pu4&latitude=26.88082"+ "&limit=24&longitude=112.68573&offset=0&terminal=web";// 创建客户端HttpClient client = new HttpClient();HttpMethod method = new GetMethod(url);client.executeMethod(method);byte[] bytes = method.getResponseBody();// 更改字符编码集String json = new String(bytes, "utf-8");System.out.println(json);}}


结果:



Jsoup:

package com.yc.elm.utils;import org.jsoup.Connection;import org.jsoup.Connection.Response;import org.jsoup.Jsoup;public class GetDate {public static void main(String[] args) throws Exception {String url = "https://www.ele.me/restapi/shopping/restaurants?"+ "extras%5B%5D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&"+ "longitude=112.68484&offset=0&terminal=web";Connection con = Jsoup.connect(url);Response response = con.execute();System.out.println(response.body());}}

出现错误:

Exception in thread "main" org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/json, URL=https://www.ele.me/restapi/shopping/restaurants?extras%255B%255D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&longitude=112.68484&offset=0&terminal=webat org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:689)at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)at com.yc.elm.utils.GetDate.main(GetDate.java:14)

这是因为没有指定类型。jsoup不支持json返回类型,所以这里我们使用.ignoreContentType(true)来忽略返回值类型。

package com.yc.elm.utils;import org.jsoup.Connection;import org.jsoup.Connection.Response;import org.jsoup.Jsoup;public class GetDate {public static void main(String[] args) throws Exception {String url = "https://www.ele.me/restapi/shopping/restaurants?"+ "extras%5B%5D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&"+ "longitude=112.68484&offset=0&terminal=web";Connection con = Jsoup.connect(url).ignoreContentType(true);Response response = con.execute();System.out.println(response.body());}}

结果:



爬到数据,接下来我们就是用json工具进行解析就可以了。具体内容看下一篇博客


阅读全文
0 0
原创粉丝点击