R 语言爬虫 之 cnblog博文爬取

来源:互联网 发布:图像算法工程起薪 编辑:程序博客网 时间:2024/05/22 12:00

来源:R 语言爬虫 之 cnblog博文爬取

a). 加载用到的R包

##library packages needed in this caselibrary(proto)library(gsubfn)
## Warning in doTryCatch(return(expr), name, parentenv, handler): 无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’::##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
library(bitops)library(rvest)library(stringr)library(DBI)library(RSQLite)library(sqldf)library(RCurl)library(ggplot2)library(sp)library(raster)##由于我们的电脑一般是中文环境,但是我想要Monday,Tuesday,所以,这时需要增加设置参数##来告知系统采用英文(北美)环境用法。Sys.setlocale("LC_TIME", "C")
## [1] "C"

b). 自定义一个函数,后续用于爬取信息。

个人实操补充:www.cnblogs.com/p按照原来的代码输入,改成www.cnblogs.com却错了。
对于安装包问题,重复再来一次就加载成功了!!!!!!!!!!!!!!!!!!!

下载的二进制程序包在

/var/folders/mn/sbfbhyln5rdddk4rp42mflpw0000gn/T//RtmpQ3p1jK/downloaded_packages里

Warning message:

In doTryCatch(return(expr), name, parentenv, handler) :

  无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’::

  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib

  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so

  Reason: image not found

> library(data.table)   # 为了rbindlist函数

data.table 1.10.4

  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way

  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")

  Release notes, videos and slides: http://r-datatable.com

> install.packages("data.table")

试开URL’https://mirror.lzu.edu.cn/CRAN/bin/macosx/el-capitan/contrib/3.4/data.table_1.10.4.tgz'

Content type 'application/octet-stream' length 1436950 bytes (1.4 MB)

==================================================

downloaded 1.4 MB

》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》






## Create a function,the parameter 'i' means page number.getdata <- function(i){    url <- paste0("www.cnblogs.com/p",i)##generate url    combined_info <- url%>%html_session()%>%html_nodes("div.post_item div.post_item_foot")%>%html_text()%>%strsplit(split="\r\n")    post_date <- sapply(combined_info, function(v) return(v[3]))%>%str_sub(9,24)%>%as.POSIXlt()##get the date    post_year <- post_date$year+1900    post_month <- post_date$mon+1    post_day <- post_date$mday    post_hour <- post_date$hour    post_weekday <- weekdays(post_date)    title <- url%>%html_session()%>%html_nodes("div.post_item h3")%>%html_text()%>%as.character()%>%trim()    link <- url%>%html_session()%>%html_nodes("div.post_item a.titlelnk")%>%html_attr("href")%>%as.character()    author <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_text()%>%as.character()%>%trim()    author_hp <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_attr("href")%>%as.character()    recommendation <- url%>%html_session()%>%html_nodes("div.post_item span.diggnum")%>%html_text()%>%trim()%>%as.numeric()    article_view <- url%>%html_session()%>%html_nodes("div.post_item span.article_view")%>%html_text()%>%str_sub(4,20)    article_view <- gsub(")","",article_view)%>%trim()%>%as.numeric()    article_comment <- url%>%html_session()%>%html_nodes("div.post_item span.article_comment")%>%html_text()%>%str_sub(14,100)    article_comment <- gsub(")","",article_comment)%>%trim()%>%as.numeric()    data.frame(title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_day,post_hour,link,author,author_hp)    }

c). 爬取 博客园的博文发布相关的数据 ,我这里只爬1-5页的数据,100条记录。

cnblog<- data.frame()for(m in 1:5){    cnblog <- rbind(cnblog,getdata(m))}

d). 查看一下爬的数据。

dim(cnblog)
## [1] 100  13
head(cnblog)
##                                                     title recommendation## 1 Dynamic CRM 2015学习笔记(3)oData 查询方法及GUID值比较              0## 2                                        Unity 之圆环算法              0## 3                                        浅谈研发项目经理              1## 4                                                C# Redis              0## 5              JavaScript系列----AJAX机制详解以及跨域通信              0## 6                                           MP4视频编码器              1##   article_view article_comment           post_date post_weekday post_year## 1            0               0 2015-04-10 20:46:00       Friday      2015## 2           58               0 2015-04-10 19:57:00       Friday      2015## 3          143               0 2015-04-10 19:38:00       Friday      2015## 4          152               2 2015-04-10 19:25:00       Friday      2015## 5           72               0 2015-04-10 19:14:00       Friday      2015## 6           72               1 2015-04-10 19:14:00       Friday      2015##   post_month post_day post_hour## 1          4       10        20## 2          4       10        19## 3          4       10        19## 4          4       10        19## 5          4       10        19## 6          4       10        19##                                                    link     author## 1       http://www.cnblogs.com/fengwenit/p/4415631.html     疯吻IT## 2 http://www.cnblogs.com/wuzhang/p/wuzhang20150410.html    wuzhang## 3        http://www.cnblogs.com/fancyamx/p/4415521.html     maxlin## 4       http://www.cnblogs.com/caokai520/p/4409712.html   每日一bo## 5     http://www.cnblogs.com/renlong0602/p/4414872.html 天天向上中## 6         http://www.cnblogs.com/dhenskr/p/4414984.html    dhenskr##                             author_hp## 1   http://www.cnblogs.com/fengwenit/## 2     http://www.cnblogs.com/wuzhang/## 3    http://www.cnblogs.com/fancyamx/## 4   http://www.cnblogs.com/caokai520/## 5 http://www.cnblogs.com/renlong0602/## 6     http://www.cnblogs.com/dhenskr/
tail(cnblog)
##                                  title recommendation article_view## 95          前端资源预加载并展示进度条              3          560## 96  Android中的Handler的机制与用法详解              1          213## 97              JS学习笔记3_函数表达式              0          219## 98                    iOS-MVVM设计模式              0          228## 99             HTML5简单入门系列(七)              0          385## 100  【Win 10应用开发】认识一下UAP项目              5          523##     article_comment           post_date post_weekday post_year post_month## 95                4 2015-04-08 18:03:00    Wednesday      2015          4## 96                0 2015-04-08 18:02:00    Wednesday      2015          4## 97                0 2015-04-08 17:56:00    Wednesday      2015          4## 98                0 2015-04-08 17:47:00    Wednesday      2015          4## 99                0 2015-04-08 17:36:00    Wednesday      2015          4## 100               6 2015-04-08 17:31:00    Wednesday      2015          4##     post_day post_hour## 95         8        18## 96         8        18## 97         8        17## 98         8        17## 99         8        17## 100        8        17##                                                              link   author## 95  http://www.cnblogs.com/lvdabao/p/resource-preload-plugin.html 每日一bo## 96            http://www.cnblogs.com/JczmDeveloper/p/4403129.html   吕大豹## 97                     http://www.cnblogs.com/ayqy/p/4403086.html Jamy Cai## 98                    http://www.cnblogs.com/xqios/p/4403071.html     梦烬## 99                   http://www.cnblogs.com/cotton/p/4403042.html   ciderX## 100                 http://www.cnblogs.com/tcjiaan/p/4403018.html 棉花年度##                                 author_hp## 95      http://www.cnblogs.com/caokai520/## 96        http://www.cnblogs.com/lvdabao/## 97  http://www.cnblogs.com/JczmDeveloper/## 98           http://www.cnblogs.com/ayqy/## 99          http://www.cnblogs.com/xqios/## 100        http://www.cnblogs.com/cotton/

e). 我这里只查看Mar.02-Mar.29四个周的博文数据,下面对数据进行简单处理。

##cnblog_Mar<- sqldf("select * from cnblog where post_day>=2 and post_day<=29")#这里我们只分析3月份四个周的数据。cnblog_Mar$post_weekday<- factor(cnblog_Mar$post_weekday,order=TRUE,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))cnblog_Mar$post_hour <- as.factor(cnblog_Mar$post_hour)

f). 简单数据分析——图表呈现

Mar.02-Mar.29,博客发布数量按周分布

ggplot(data=cnblog_Mar,aes(post_weekday))+geom_bar()

每日博文数量分布

ggplot(data=cnblog_Mar,aes(post_date))+geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

每个小时博文发布量分布

ggplot(data=cnblog_Mar,aes(post_hour))+geom_bar()


g). 总结

  • 用R 写的这种爬虫总共是两部分 1.自定义一个函数,将原型写出来。2.利用自定义函数,写个循环就可以欢乐的爬数据了。其中第一步找准确html_nodes是关键
  • Google chrome 浏览器结合 CSS selector 使用,寻找html_nodes 非常方便。
  • 存在table的网站,爬数据更方便。例如NBA 2014-2015常规赛技术统计排行 - 得分榜,若不存在table,则需要一个字段一个字段的爬取了,例如博客园。
  • 商业网站的数据都是他们的宝藏,例如淘宝,京东,携程等
  • 接下来打算爬一些招聘网站的数据,分析一下自己感兴趣的行业的薪资待遇,以及是哪些公司在招聘,工作地点在哪里,岗位对技术要求是什么,是R,是python,是SAS,还是数据库等等。
  • 我的前两次爬虫在博客园写了一篇博客,有兴趣的可以去看看R语言网络爬虫学习 基于rvest包
原创粉丝点击