Rcurl小应用,爬取京东评论

来源:互联网 发布:全国网络培训中心 编辑:程序博客网 时间:2024/05/21 10:46

利用Rcurl包做的一个小爬虫,爬取了京东上电热水器的评论

<span style="font-family: Arial, Helvetica, sans-serif;">#利用Rcurl抓取京东页面上电热水器的评论</span>
library(RCurl)library(XML)library(plyr)#要爬取数据的(京东)网址,共有56页page <- 1:56urlist <- paste("http://club.jd.com/allconsultations/1121567-",page,"-1.html",sep="")#伪造请求报头myheader=c("User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",           "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",           "Accept-Language"="en-us",           "Connection"="keep-alive",           "Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7")#下载网址webpage = getURL(urlist,httpheader=myheader,.encoding='utf-8')#解析xml文件pagetree = htmlParse(webpage,encoding='utf-8')#利用xpath语句读取在节点中的信息time = xpathSApply(pagetree,"//div[@class='r_info']",xmlValue)ask = xpathSApply(pagetree,"//dl[@class='ask']/dd/a",xmlValue)#解决中文乱码问题ask = iconv(ask,"utf-8","LATIN1")#数据微调time = laply(time,function(x){  unlist(substring(x,nchar(x)-18,nchar(x)))})ask = laply(ask,function(x){  unlist(strsplit(x,'\n'))[2]})ask = gsub(" ","",ask)#组装成数据框data = data.frame("时间"=time,"内容"=ask)#导出数据write.csv(data,"评论数据.csv")

最后爬取的数据如下 

                  时间                               内容403 2014-05-27 11:09:20                           有线控么404 2014-05-24 20:55:29             烧水一次断电保温多久?405 2014-05-23 12:50:41                         几级能效的406 2014-05-23 12:00:30                       几级能效的?407 2014-05-20 14:48:49                   此款有木有防电墙408 2014-05-13 09:54:47 热水器以后是京东负责维修还是海尔?


1 0