R语言文本分析(3)

来源:互联网 发布:网页分类算法 编辑:程序博客网 时间:2024/06/05 18:37

R语言文本分析(3)

跟着《R and Data Mining: Examples and Case Studies》中Text Mining节写代码。前面的步骤还算顺利,心中窃喜。但从stem completion就开始出错。尚未找到解决办法。
代码抄上:

library(twitteR)load("rdmTweets.RData")df <- do.call("rbind", lapply(rdmTweets, as.data.frame))dim(df)head(df)library(tm)myCorpus <- Corpus(VectorSource(df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, content_transformer(tolower))# remove punctuationmyCorpus <- tm_map(myCorpus, removePunctuation)# remove numbersmyCorpus <- tm_map(myCorpus, removeNumbers)# remove UrlsmyCorpus <- tm_map(myCorpus, removeURL)# my stop wordsmyStopwords <- c(stopwords('english'), "avaialbe", "via")myStopwords <- setdiff(myStopwords, c("r", "big"))# remove stopwordsmyCorpus <- tm_map(myCorpus, removeWords, myStopwords)myCorpusCopy <- myCorpus# stem wordsmyCorpus <- tm_map(myCorpus, stemDocument)inspect(myCorpus[11:15])for(i in 11:15) {  cat(paste("[[",i,"]]", sep = ""))  writeLines(strwrap(myCorpus[[i]], width = 73))}# stem completionmyCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

至此,开始出错,错误信息为:

> myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)Warning message:In mclapply(content(x), FUN, ...) :  all scheduled cores encountered errors in user code

求助stackoverflow后找到以下几个Solutions:

  • Solution 1:
    myCorpus <- tm_map(myCorpus, tolower) 替换为myCorpus <- tm_map(myCorpus, content_transformer(tolower))
    结果: 无效。
  • Solution 2

    # Stem completion
    myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

替换为:

# Stem completionstemCompletion_mod <- function(x,dict) {     PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), type = "shortest"), sep = "", collapse = " ")))}# apply workaround function myCorpus <- lapply(myCorpus, stemCompletion_mod, myCorpusCopy)

结果:无效。

0 0
原创粉丝点击