利用python爬取我爱我家租赁房源信息

来源:互联网 发布:初请失业金数据公布网 编辑:程序博客网 时间:2024/04/30 21:54

主要思路:

1.通过get方法向服务器提交head文件和cookie信息(通过在chrome网页上面登录之后获取,避免了通过账号密码模拟登陆的繁琐过程),实现模拟登陆的效果
2.访问网页,通过万能的正则匹配到所需要的信息

具体算法有3步骤:

1.从租赁房源的第一页至第100页get网页信息,每页对应的url为:url_1='http://bj.5i5j.com/rent/n%d',获取每个网页里面的房源编号;
2.通过每个房源的房源编号进入该房源界面,爬取该房源的'价格','户型','面积','朝向','楼层','小区名称'信息
3.将爬取的房源信息存储在5i5j_house_info.xlsx下

待完善:

1.由于访问每个网页会花费很长时间的io开销,后期会将房源编号放在队列中,通过队列锁+多线程提高爬虫速度

2.我爱我家没有反爬虫机制,如遇到反爬虫可以采用代理ip,多浏览器,多账号等进行爬虫

结果如下:


具体代码如下:

#get_525j_house_infoimport requestsimport jsonimport timeimport urllib.requestfrom win32.win32crypt import CryptUnprotectDatafrom urllib import parseimport reurl_1='http://bj.5i5j.com/rent/n%d'#头文件信息,可用于模拟登陆httphead='''Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8Accept-Encoding:gzip, deflateAccept-Language:zh-CN,zh;q=0.8Cache-Control:max-age=0Connection:keep-aliveCookie:suid=8715039984; BIGipServer=3647539722.20480.0000; PHPSESSID=a2a4l07dfeejokb2k2u1hgt7c6; yfx_c_g_u_id_10000001=_ck17101612002011032445773465255; renthistorys=%5B%7B%22id%22%3A%22166475545%22%2C%22imgurl%22%3A%22house%5C%2F3768%5C%2F37688980%5C%2Fshinei%5C%2Ffadhfope9b890c26.jpg%22%2C%22housetitle%22%3A%22%5Cu7802%5Cu8f6e%5Cu5382%5Cu5bbf%5Cu820d+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu6e05%5Cu6cb3%22%2C%22buildarea%22%3A%2263.85%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu7802%5Cu8f6e%5Cu5382%5Cu5bbf%5Cu820d%22%2C%22price%22%3A%222200%22%2C%22onePrice%22%3A344558%7D%2C%7B%22id%22%3A%22166968941%22%2C%22imgurl%22%3A%22house%5C%2F3741%5C%2F37414743%5C%2Fshinei%5C%2Fnhecpeho0a3e23b4.jpg%22%2C%22housetitle%22%3A%22%5Cu8d22%5Cu5927%5Cu5bb6%5Cu5c5e%5Cu9662+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%2210%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu8d22%5Cu5927%5Cu5bb6%5Cu5c5e%5Cu9662%22%2C%22price%22%3A%222000%22%2C%22onePrice%22%3A2000000%7D%2C%7B%22id%22%3A%22171185475%22%2C%22imgurl%22%3Anull%2C%22housetitle%22%3A%22%5Cu6c38%5Cu65fa%5Cu5bb6%5Cu56ed+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%2230%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu6c38%5Cu65fa%5Cu5bb6%5Cu56ed%22%2C%22price%22%3A%221600%22%2C%22onePrice%22%3A533333%7D%2C%7B%22id%22%3A%22166983380%22%2C%22imgurl%22%3A%22house%5C%2F3772%5C%2F37726408%5C%2Fshinei%5C%2Foahhomjoe708d412.jpg%22%2C%22housetitle%22%3A%22%5Cu767e%5Cu65fa%5Cu5bb6%5Cu82d1+4%5Cu5ba41%5Cu53852%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%22140%22%2C%22hallhouse%22%3A%224%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu767e%5Cu65fa%5Cu5bb6%5Cu82d1%22%2C%22price%22%3A%222000%22%2C%22onePrice%22%3A142857%7D%5D; searchHistorys=%5B%7B%22name%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22spell%22%3A%22haidian%22%2C%22level%22%3A3%2C%22id%22%3A%225%22%7D%2C%7B%22name%22%3A%22nanwu%22%2C%22spell%22%3A%22%22%2C%22level%22%3A1%2C%22id%22%3A%22%22%7D%2C%7B%22name%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22spell%22%3A%22shangdi%22%2C%22level%22%3A4%2C%22id%22%3A%2236854%22%7D%5D; yfx_f_l_v_t_10000001=f_t_1508126420098__r_t_1509969433602__v_t_1509969433602__r_c_1; __utmt=1; __utmt_t2=1; _va_ref=%5B%22%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98%22%2C%22%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98%22%2C1509969435%2C%22http%3A%2F%2Fbzclk.baidu.com%2Fadrc.php%3Ft%3D06KL00c00f7SfKC0mn7m0KFRQ00NH6Kp00000F9_U7b000000TGQTM.THYdpHNJcQMuVeLPSPyS0A3qmh7GuZR0T1dhuyN9P1mkn10snjubuywW0ZRqPWD1wHbznHcdnb7KfW0sn16zwW04PWbsn1wAwHRsrHn0mHdL5iuVmv-b5Hnsn1TvnWcvn1fhTZFEuA-b5HDv0ARqpZwYTjCEQvFJQWNGPyC8mvqVQ1qdIAdxTvqdThP-5yF9pywdFMNYUNqVuywGIyYqTZKlTiudIAdxIANzUHY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hIgwVgvPEUMw-UMfqFyRdFHb1FH6kFyRYFyc3FHb1FyRvFyDsFH6LFyR4FyDzFHb3FMNYUNqWmydsmy-MUWY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hUAVdUHYzPsKWThnqnHDvn1T%26tpl%3Dtpl_10144_15654_11145%26l%3D1500277912%26attach%3Dlocation%3D%26linkName%3D%25E6%25A0%2587%25E9%25A2%2598%26linkText%3D%25E6%2588%2591%25E7%2588%25B1%25E6%2588%2591%25E5%25AE%25B6%25EF%25BC%258C%25E5%2585%25A8%25E5%25BF%2583%25E5%2585%25A8%25E6%2584%258F%25E6%2589%25BE%25E6%2588%25BF%25EF%25BC%258C%25E7%259C%259F%25E5%25BF%2583%25E5%25AE%259E%25E6%2584%258F%26xp%3Did(%2522m501af8ab%2522)%252FDIV%255B1%255D%252FDIV%255B1%255D%252FDIV%255B1%255D%252FDIV%255B1%255D%252FH2%255B1%255D%252FA%255B1%255D%26linkType%3D%26checksum%3D211%26ie%3Dutf-8%26f%3D3%26tn%3Dbaidu%26wd%3D5i5j%20%E5%AE%98%E6%96%B9%E7%BD%91%E7%AB%99%26oq%3D525j%26rqlang%3Dcn%26inputT%3D8167%26rsp%3D0%22%5D; __utma=1.68281274.1508126420.1508130935.1509969435.3; __utmb=1.5.10.1509969435; __utmc=1; __utmz=1.1509969435.3.3.utmcsr=baidu|utmccn=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcmd=ppzq|utmctr=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcct=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98; __utma=228451417.694811314.1508126420.1508130935.1509969435.3; __utmb=228451417.5.10.1509969435; __utmc=228451417; __utmz=228451417.1509969435.3.3.utmcsr=baidu|utmccn=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcmd=ppzq|utmctr=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcct=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98; _va_id=9086dafceef2e930.1508126421.3.1509969793.1509969435.; _va_ses=*; Hm_lvt_0bccd3f0d70c2d02eb727b5add099013=1508126420,1508130935,1509969434; Hm_lpvt_0bccd3f0d70c2d02eb727b5add099013=1509969793; Hm_lvt_fbfca6a323fa396dde12616e37bc1df9=1508126420,1508130935,1509969434; Hm_lpvt_fbfca6a323fa396dde12616e37bc1df9=1509969793; Hm_lvt_b3ad53a84ea4279d8124cc28d3c3220f=1508126420,1508130935,1509969434; Hm_lpvt_b3ad53a84ea4279d8124cc28d3c3220f=1509969793; _pzfxuvpc=1508126420423%7C1062778321885559810%7C33%7C1509969793161%7C3%7C7558331484128782074%7C4424313154138490995; _pzfxsvpc=4424313154138490995%7C1509969434144%7C5%7Chttp%3A%2F%2Fbzclk.baidu.com%2Fadrc.php%3Ft%3D06KL00c00f7SfKC0mn7m0KFRQ00NH6Kp00000F9_U7b000000TGQTM.THYdpHNJcQMuVeLPSPyS0A3qmh7GuZR0T1dhuyN9P1mkn10snjubuywW0ZRqPWD1wHbznHcdnb7KfW0sn16zwW04PWbsn1wAwHRsrHn0mHdL5iuVmv-b5Hnsn1TvnWcvn1fhTZFEuA-b5HDv0ARqpZwYTjCEQvFJQWNGPyC8mvqVQ1qdIAdxTvqdThP-5yF9pywdFMNYUNqVuywGIyYqTZKlTiudIAdxIANzUHY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hIgwVgvPEUMw-UMfqFyRdFHb1FH6kFyRYFyc3FHb1FyRvFyDsFH6LFyR4FyDzFHb3FMNYUNqWmydsmy-MUWY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hUAVdUHYzPsKWThnqnHDvn1T%26tpl%3Dtpl_10144_15654_11145%26l%3D1500277912%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253D%2525E6%252588%252591%2525E7%252588%2525B1%2525E6%252588%252591%2525E5%2525AE%2525B6%2525EF%2525BC%25258C%2525E5%252585%2525A8%2525E5%2525BF%252583%2525E5%252585%2525A8%2525E6%252584%25258F%2525E6%252589%2525BE%2525E6%252588%2525BF%2525EF%2525BC%25258C%2525E7%25259C%25259F%2525E5%2525BF%252583%2525E5%2525AE%25259E%2525E6%252584%25258F%2526xp%253Did(%252522m501af8ab%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D211%26ie%3Dutf-8%26f%3D3%26tn%3Dbaidu%26wd%3D5i5j%2520%25E5%25AE%2598%25E6%2596%25B9%25E7%25BD%2591%25E7%25AB%2599%26oq%3D525j%26rqlang%3Dcn%26inputT%3D8167%26rsp%3D0; Hm_lvt_407473d433e871de861cf818aa1405a1=1508126427,1508130941,1509969440; Hm_lpvt_407473d433e871de861cf818aa1405a1=1509969798; domain=bjHost:bj.5i5j.comReferer:http://bj.5i5j.com/rentUpgrade-Insecure-Requests:1User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'''#获取head和cookiesdef get_head_cookies(httphead):    head={}    for i in httphead.strip().split('\n'):        if re.match('Referer',i):            head['Referer']=i[len('Referer'):]            continue        line=i.strip().split(':')        head[line[0]]=line[1]    cookie=head['Cookie']    cookies={}    for i in cookie.strip().split(';'):        line=i.strip().split('=')        cookies[line[0]]=line[1]    return head,cookieshead,cookies=get_head_cookies(httphead)#获取房源idhouse_id='<a href="/rent/([\d]{9})"'def get_house_ids(url):    return set(re.findall(house_id,requests.get(url=url,headers=head,cookies=cookies).content.decode('utf-8')))house_ids=set()page_url_error=[]for i in range(1,10):    try:        house_ids.update(get_house_ids(url_1%(i)))    except:        page_url_error.append(i)#对于每个房源进行爬取信息house_inf='''<ul class="house-info">.+?"font-price">([\d|\.]+?)</span> 元/月.+?<b>户型:</b>(.+?)&.+?<b>面积:</b>(.+?)</li>.+?<b>朝向:</b>(.+?)</li>.+?<b>楼层:</b>(.+?)</li>.+?<b>小区:</b>(.+?)\s+.+?</li>.+?</ul>'''def get_house_inf(house_id):    return re.findall(house_inf,requests.get(url='http://bj.5i5j.com/rent/'+house_id,headers=head,cookies=cookies).content.decode('utf-8'),re.S|re.M)[0]from openpyxl import Workbookfile='F:\\临时工作\\1023\\5i5j_house_info.xlsx'wb_bj=Workbook()ws_bj=wb_bj.worksheets[0]ws_bj.title='房源信息表'#获取信息类似于('6800', '2室1厅1卫', '50.45平米', '南', '中部/13层', '崇文门西大街')line_1=['价格','户型','面积','朝向','楼层','小区名称']ws_bj.append(line_1)house_id_error=[]for house_id in house_ids:    try:        ws_bj.append(get_house_inf(house_id))    except:        house_id_error.append(house_id)wb_bj.save(file)#输出出错对应的网页url和户型idprint(page_url_error,house_id_error)
 
原创粉丝点击