java 爬取房天下房源数据

来源:互联网 发布:c语言优先级 编辑:程序博客网 时间:2024/04/26 03:32

文章出处:点击打开链接

笔者说明~~~!!!只用于学习交流,私自用于其他途径,后果自负!!! 

利用httpclient4.5模拟请求,jsoup进行页面解析

  一、分析页面,确定需要爬取数据

      如下图页面,每一页的url需要得到,同时需要得到该房源详细信息以及对应经纪人信息

         

二、由此创建如下类:

/** * 房源信息 * @author yangJun * */public class HouseInfo  implements Serializable{/** *  */private static final long serialVersionUID = 1L;/** * 标题 */private String title;/** * 总价 */private String finalPrice;/** * 参考首付 */private String referPay;/** * 参考月供 */private String referMonth;/** * 户型 */
private String apartmentLayout;
/** * 建筑面积
*/ private String buildMeasureOfArea; /** * 使用面积 */ private String useMeasureOfArea; /** * 年代 */ private String years; /** * 朝向 */ private String orientation; /** * 楼层 */ private String floor;
/** * 结构
*/ private String structure; /** * 装修
*/ private String renovation;
/** * 住宅类别 */ private String residCategory; /** * 建筑类别 */ private String architCategory; /** * 产权性质 */ private String propertyRight; /** * 楼盘名称 */
private String propertyName; /** * 配套设施 */ private String suppFacil; /**
* 房源描述 */ private String describe; /** * 联系电话 */
private String phone; /** * 联系人名称 */ private String persionName;
/** * 地址 */ private String address; /** * 交通状况 */ private String traffic; /**
* 来源url */ private String houserFromUrl; /**
* 更新时间 */ private String updateTime; /** * 爬取时间
*/private String climbingTime;.....................
/** * 对应经纪人信息 * @author yangJun *
*/public class ContactsInfo implements Serializable{/**
* */private static final long serialVersionUID = 1L;/**
* 名称 */private String name;/** * 好评率 */private String rateOfPraise;/** * 真实度 */private String truthDegree;
/** * 满意度 */private String satisDegree;
/** * 专业度 */private String professDegree;/** * 电话 */
private String phone;/** * 详细信息页面 */private String detailedInfoUrl;/** * 头像地址 */private String photoUrl;...............
/**
* 对应小区信息
* @author yangJun * */public class CommunityInfo implements Serializable {/**
* */
private static final long serialVersionUID = 1L;/** * 楼盘详细信息url */private String propertyInfoUrl;
/** * 楼盘名称 */ private String propertyName; /** * 二手房 */ private String secondHandHousing; /** * 租房
*/ private String rental; /** * 物业类型
*/ private String propertyType; /**
* 绿化率 */ private String greeningRate; /** * 物业费 */ private String propertyFee;
/** * 物业公司 */ private String propertyCompany; /** * 开发商 */
private String developers; /** * 此楼盘此月均价 */
private String averagePrice; /**
* 环比上月 */ private String thanLastMonth; /** * 同比上年 */
private String yearOverYear;
/** * 封装对象 * @author yangJun * */
public class SummaryModel implements Serializable{/** *
*/private static final long serialVersionUID = 1L;/** * 小区相关信息 */private CommunityInfo comm;/**
* 经纪人信息 */
private ContactsInfo contactsInfo;/** * 房屋信息 */private HouseInfo houseInfo;/**
* 查询条件信息 */

三、分析页面,分析页面数据返回方式,非动态生成,因此我们在这里只需要获取到生成返回的html页面然后进行解析就可以了

      1、根据根url获取可爬取url列表,这里我们实验的是二手房信息,并定位到西安

根url: http://esf.xian.fang.com/

/**
* 根据根url获取可爬取页面url集合 * @param map * @param url * @return
*/public static Map<String,String> getAllUrl(Map<String, String> map,String url){String res="";res= BaseGetPage.reqHttpGet(url);
Document html = null;
if(!"".equals(res)){html=Jsoup.parse(res);String nextUrl=""; Element elementById = html.getElementById("list_D10_15"); //当前页 String now="共"+elementById.getElementsByClass("pageNow").html()+"页"; //总页数 String allTxt=elementById.select("span").html(); elementById.getElementsByClass("pageNow").remove(); Elements elementsByTag = elementById.getElementsByTag("a"); for (int i = 0; i < elementsByTag.size(); i++) {
  if(elementsByTag.get(i).hasAttr("id")){  elementsByTag.get(i).remove();}else{map.put(ClimbingModel.URL_FangTianxia+elementsByTag.get(i).attr("href"), elementsByTag.get(i).html());
nextUrl=elementsByTag.get(i).attr("href");}  if(i==(elementsByTag.size()-1)){//需要进行下一轮页面获取urlif(allTxt.equals(now)){break;}else{getAllUrl(map, ClimbingModel.URL_FangTianxia+nextUrl);
}}}}else{System.out.println("未知错误!无法获取信息");}return map;
}

上面代码中的BaseGetPage.reqHttpGet(url),是获取html页面,在之前的博文中有代码,这里不在添加

代码的实现根据下图的html代码进行截取

         

下图为代码运行结果


2、在这里我们获得了页面地址,那么下来就是进行遍历,根据每个页面url获得经纪人url,更新时间,房屋详细信息页面url

public  static Map<String,String> getNowUrlTime(String url){if(url.equals("")){url=ClimbingModel.URL_FangTianxia;}String res="";res= BaseGetPage.reqHttpGet(url);Document html = null;Map<String,String> map=new HashMap<String,String>();if(!"".equals(res)){html=Jsoup.parse(res);//System.out.println(html);Elements elementsByClass = html.getElementsByClass("houseList");if(elementsByClass.size()>0){Elements elementsByClass2 = elementsByClass.get(0).getElementsByTag("dl");for (Element element : elementsByClass2){String houseInfoUrl = element.select("a").attr("href");String ContactsInfoUrl= element.getElementsByClass("gray6").select("a").attr("href");String updateTime = element.getElementsByClass("gray6").select("span").html();

map.put(ClimbingModel.URL_FangTianxia+houseInfoUrl+";"+ClimbingModel.URL_FangTianxia+ContactsInfoUrl, updateTime);}}else{System.out.println("请您查看,页面获取有误!");}}else{System.out.println("未知错误!无法获取信息");}return map;}}


如图片中的;分别是房源信息url和经纪人信息

3、最后进行具体信息的获取

/** * 根据每页url获取房屋详细和小区相关信息,以及经纪人的基本信息和url * @return */public  static SummaryModel getHouseInfo(String url,String updateTime){//String url=ClimbingModel.URL_FangTianxia+"/chushou/3_153474871.htm";SummaryModel sumarModel=new SummaryModel();HouseInfo house=new HouseInfo();CommunityInfo comm=new CommunityInfo();ContactsInfo contact=new ContactsInfo();String res="";res= BaseGetPage.reqHttpGet(url);house.setUpdateTime(updateTime);Document html = null;if(!"".equals(res)){html=Jsoup.parse(res);//System.out.println(html);house.setHouserFromUrl(url);//System.out.println("title :"+html.getElementsByTag("title").html());house.setTitle(html.getElementsByTag("title").html());Elements elementsByClass = html.getElementsByClass("inforTxt");if(elementsByClass.size()<=0){System.out.println("么有数据");}Elements elementsByClass1 = elementsByClass.get(0).getElementsByTag("dd");for (Element element : elementsByClass1) {if(element.hasAttr("style")){element.remove();}else{element.getElementsByClass("padl27").remove();String s=element.select("span").html();element.select("span").remove();String s1=element.html();if(s1.contains("面积")){if(s1.contains("使用")){house.setUseMeasureOfArea(s1+s.replaceAll("�", "")+"㎡");}else{house.setBuildMeasureOfArea(s1+s.replaceAll("�", "")+"㎡");
}//System.out.println(s1+s.replaceAll("�", "")+"㎡");}else{if(s.contains("参考首付")){house.setReferPay(s+s1);}else if(s.contains("户型")){house.setApartmentLayout(s+s1);}else if(s.contains("年代")){house.setYears(s+s1);}else if(s.contains("朝向")){house.setOrientation(s+s1);}else if(s.contains("楼层")){house.setFloor(s+s1);}else if(s.contains("结构")){house.setStructure(s+s1);}else if(s.contains("装修")){house.setRenovation(s+s1);}else if(s.contains("类别")){house.setResidCategory(s+s1);}else if(s.contains("产权")){house.setPropertyRight(s+s1);
} //System.out.println(s+s1);}}}Elements elementsByClass2 = elementsByClass.get(0).getElementsByTag("dt");for (int i = 0; i < elementsByClass2.size(); i++) {if(i==0){//总价elementsByClass2.get(i).getElementsByTag("a").remove();//System.out.println(elementsByClass2.get(i).select("span").html());house.setFinalPrice(elementsByClass2.get(i).select("span").html());}else if(i==1){elementsByClass2.get(i).getElementsByTag("a").last().remove();//System.out.println(elementsByClass2.get(i).select("span").html()+elementsByClass2.get(i).select("a").first().html());//楼盘名称house.setPropertyName(elementsByClass2.get(i).select("span").html()+elementsByClass2.get(i).select("a").first().html());elementsByClass2.get(i).select("a").first().remove();////System.out.println(elementsByClass2.get(i).select("a").html());house.setAddress(elementsByClass2.get(i).select("a").html());}else  if(i==2){//System.out.println(elementsByClass2.get(i).select("span").html());house.setSuppFacil(elementsByClass2.get(i).select("span").html());}
}Element elementById = html.getElementById("hsPro-pos"); Elements elementsByTag2 = elementById.getElementsByTag("div"); elementsByTag2.last().remove();if(elementById.select("div").hasAttr("style")){//描述System.out.println(elementById.select("div").last().html());house.setDescribe(elementById.select("div").last().html());}Element elementById2 = html.getElementById("hsMap-pos");Elements elementsByTag = elementById2.getElementsByTag("p");for (int i = 0; i < elementsByTag.size(); i++) {if(i==0){//更详细地址elementsByTag.get(i).getElementsByClass("pad127").remove();//System.out.println(elementsByTag.get(i).select("span").html());elementsByTag.get(i).select("span").remove();house.setAddress(house.getAddress()+"-"+elementsByTag.get(i).html());//System.out.println(elementsByTag.get(i).html());}else if(i==1){//交通状况System.out.println(elementsByTag.get(i).select("span").html());elementsByTag.get(i).select("span").remove();//System.out.println(elementsByTag.get(i).html());house.setTraffic(elementsByTag.get(i).html());}}SimpleDateFormat  si=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");house.setClimbingTime(si.format(new Date()));/** * 小区信息的获取
 */Element elementById3 = html.getElementById("hsmPro-pos");Elements elementsByClass3 = elementById3.getElementsByTag("dl");for (int i = 0; i <elementsByClass3.size(); i++) {if(i==0){Elements elementsByTag3 = elementsByClass3.get(i).getElementsByTag("dt");comm.setPropertyInfoUrl(elementsByTag3.get(0).select("a").attr("href"));//System.out.println(elementsByTag3.get(0).select("a").attr("href"));elementsByTag3.get(0).select("a").remove();String s=elementsByTag3.get(0).select("span").html();elementsByTag3.get(0).select("span").remove();//System.err.println(s+elementsByTag3.get(0).html().replaceAll("&nbsp;", ""));comm.setPropertyName(s+elementsByTag3.get(0).html().replaceAll("&nbsp;", ""));Elements elementsByTag4 = elementsByClass3.get(i).getElementsByTag("dd");for (Element element : elementsByTag4) {String html2 = element.select("span").html();element.select("span").remove();String html4 = element.select("a").html();element.select("a").remove();
String html3 = element.html();//System.out.println(html2+html4+html3);if(html2.contains("二手房")){comm.setSecondHandHousing(html2+html4+html3);}else if(html2.contains("租房")){comm.setRental(html2+html4+html3);}else if(html2.contains("物业类型")){comm.setPropertyType(html2+html4+html3);}else if(html2.contains("绿化率")){comm.setGreeningRate(html2+html4+html3);}else if(html2.contains("物业费")){comm.setPropertyFee(html2+html4+html3);}else if(html2.contains("物业公司")){comm.setPropertyCompany(html2+html4+html3);}else if(html2.contains("开发商")){comm.setDevelopers(html2+html4+html3);}}}else if(i==1){Elements elementsByTag4 = elementsByClass3.get(i).getElementsByTag("dt");String html2 = elementsByTag4.select("span").html();elementsByTag4.select("span").remove();
elementsByTag4.select("a").remove();//System.out.println(html2+elementsByTag4.html());comm.setAveragePrice(html2+elementsByTag4.html());Elements elementsByTag5 = elementsByClass3.get(i).getElementsByTag("dd");for (Element element : elementsByTag5) {String html3 = element.getElementsByTag("span").last().html();element.select("span").remove();//System.out.println(element.html()+html3);if(element.html().contains("环比上月")){comm.setThanLastMonth(element.html()+html3);}else if(element.html().contains("同比去年")){comm.setYearOverYear(element.html()+html3);}}}}/** * 进行联系人的相关信息获取 */
Elements elementsByClass4 = html.getElementsByClass("leftBox");Elements elementsByTag3 = elementsByClass4.get(5).getElementsByTag("dl");if(elementsByTag3.size()>0){Elements elementsByTag4 = elementsByTag3.get(0).getElementsByTag("dd");for (Element element : elementsByTag4) {String html2 = element.select("span").html();element.select("span").remove();if(element.html().contains("好评率")){contact.setRateOfPraise(element.html()+html2);}else if(element.html().contains("真实度")){contact.setTruthDegree(element.html()+html2);}else if(element.html().contains("满意度")){contact.setSatisDegree(element.html()+html2);}else if(element.html().contains("专业度")){contact.setProfessDegree(element.html()+html2);}System.out.println(element.html()+html2);}}
if(elementsByTag3.size()>=2){Element element = elementsByTag3.get(1);Elements elementsByTag5 = element.getElementsByTag("img");if(null!=elementsByTag3.get(1).select("a").last().html()){contact.setName(elementsByTag3.get(1).select("a").last().html()!=null?elementsByTag3.get(1).select("a").last().html():"");}contact.setDetailedInfoUrl(ClimbingModel.URL_FangTianxia+elementsByTag3.get(1).select("a").last().html());contact.setPhotoUrl(elementsByTag5.get(0).attr("src"));}
sumarModel.setHouseInfo(house);sumarModel.setComm(comm);sumarModel.setContactsInfo(contact);//System.out.println(contact);}else{System.out.println("未知错误!无法获取信息");}return sumarModel;}

运行结果:

SummaryModel [comm=CommunityInfo [propertyInfoUrl=http://sijihuayuanjq.fang.com/, propertyName=楼盘名称:金桥四季花园( 城南 电子城 ) , secondHandHousing=null, rental=null, propertyType=物业类型:住宅, greeningRate=null, propertyFee=null, propertyCompany=物业公司:鸿兴物业, developers=null, averagePrice=金桥四季花园本月均价:

6244元/平方米, thanLastMonth=环比上月:↓1.17%, yearOverYear=同比去年:↑3.76%], contactsInfo=ContactsInfo [name=张鹏云, rateOfPraise=好评率:0%, truthDegree=真实度:, satisDegree=满意度:, professDegree=专业度:, phone=null, detailedInfoUrl=http://esf.xian.fang.com/张鹏云, photoUrl=http://img1.soufunimg.com/usercenter/2016_06/12/10/avatar/120_164661735_0.jpg], houseInfo=HouseInfo [title=拎包入住 南北通透 大3居 中等楼层 惊爆价出售!,西安城南电子城金桥四季花园二手房三室 - 房天下, finalPrice=88

万, referPay=参考首付:

26.4万, referMonth=null, apartmentLayout=户型:3室2厅2厨2卫, buildMeasureOfArea=建筑面积:138㎡, useMeasureOfArea=null, years=年代:2004年, orientation=朝向:南北, floor=楼层:中层(共12层), structure=null, renovation=装修:精装修, residCategory=建筑类别:板楼, architCategory=null, propertyRight=产权性质:个人产权, propertyName=楼盘名称:金桥四季花园, suppFacil=配套设施:

水,煤气/天然气, describe=自我介绍:房天下张鹏云将竭诚为您服务!

<br />户型介绍:核心卖点:1、户型南北朝向,客厅落地阳台,主卧景观飘窗,面朝小区,景观好,采光好,位置安静;

<br />2、装修房东装修准备自己用的房子,不用再次装修,特别省心;

<br />3、产权70年全产权有证房,有证可按揭,交易全程专业律师及金融专员追踪,让您放心购房,可随时签约过户、可办银行按揭贷款;

<br />小区介绍:4、房子精装修3居1梯2户采光特别好

<br />其他:房源优势

<br />证满两年,全明结构,业主急卖,价格实惠;

<br />核心卖点:电子城精装三室88万南北通透中间楼层急售, phone=null, persionName=null, address=城南

电子城-电子城步行街南800米, traffic=南山门站:204、225、506、 526、 706、 716, houserFromUrl=http://esf.xian.fang.com//chushou/3_153359421.htm, updateTime=房天下自营

1分钟前更新, climbingTime=2016-12-26 10:44:52], searchCriteri=null, nextUrl=null, count=null]

http://esf.xian.fang.com//chushou/3_153467676.htm;http://esf.xian.fang.com//a/gqt1511>>1分钟前更新


程序写的比较粗糙,见谅~


0 0
原创粉丝点击