排序:智能收件箱

来源:互联网 发布:自学linux运维要多久 编辑:程序博客网 时间:2024/04/28 02:11

R语言机器学习实践之排序:智能收件箱


邮件的优先级划分需要综合多种因素,诸如社交特征(比如跟什么人来往的频率较高)、发件人发件的频率、邮件的类型(比如一些广告邮件,通常很少回复)、对邮件进行的动作(是否在短时间内回复)、邮件关键字、主题等等。本案例通过对邮件优先级特征的提取,建立邮件优先级排序算法,对邮件进行优先级排序。主要练习了特征提取的方法以及R语言的操作。

一、 特征分析

 一封邮件是否应该被推荐置顶,取决于邮件主人对该邮件的重视程度。如何从众多邮件中区分出邮件主人重视的邮件呢?邮件被重视的因素是多方面、复杂多变的,也许这封邮件来自于邮件主人关心的人,也许是邮件的内容是主人关心的,又或许该邮件是主人正在等待的某个问题的解决方案等等。对于特殊的因素,一般是比较难以预测的,但是一些规律性的因素,可以通过邮件主人的行为活动进行判别,比如某个人发的邮件,主人每次都在第一时间进行回复,这样我们就有理由认为这个人发的邮件是值得重视的;又比如垃圾邮件、广告邮件,很显然不应该拥有更高的优先级。对于这个案例的思路如图所示,可以通过训练样本数据提取出一些规律用于判断邮件的优先级。
这里写图片描述

二、数据整理及特征提取

 数据来自于SampAssassin公开语料库,这些邮件只能提取一些静态的特征,无法提供更详细的数据,因此,这只是一个练习。首先导入需要用到的包:

library(tm)   #需要进行文本分析library(ggplot2) #可视化library(plyr) #数据整合data.path = "E:/R/ML/data/classify/"easyham.path = paste(data.path,"easy_ham/",sep = "")

 首先需要一个通过路径读取文件、提取特征的总调用函数,这个函数依赖于其他的子函数:

parse.email<-function(path){  full.msg<-get.full(path)  #根据路径读取邮件的所有内容  from<-get.from(full.msg)  #获取邮件来源的地址  date<-get.date(full.msg)  #获取邮件接收日期  subj<-get.subj(full.msg)  #获取邮件主题  msg<-get.msg(full.msg)    #获取邮件正文  return(c(date,from,subj,msg,path)) #返回所有特征}

 parse.email函数通过传入的邮件路径对邮件进行解析,提取出特征并返回一个特征向量。以下是所有的提取特征的函数:

get.full<-function(path){  con <- file(path,open = "rt",encoding = "latin1")  msg <- readLines(con)  close(con)  return(msg)} #这个函数按行读取邮件内容get.from <- function(msg.vec){  from <- msg.vec[grepl("From:",msg.vec)]  from <- strsplit(from,'[:<> ]')[[1]]  from <- from[which(from!=""&from!=" ")]  return(from[grepl("@",from)])}  #函数解析:邮件中来源的格式一般为:From:yunfei&lt;yunfei@163.com>   #这种类型,首先从文本向量中找出含有From:的一行,由于可能正文   #中也存在From:开头的行,但是邮件来源通常在正文前面,所以匹配   #后选择第一个列表:strsplit[][[1]],之后再匹配出含有@的元素就行了get.msg <- function(msg.vec){  msg <- msg.vec[seq(which(msg.vec=="")[1]+1,length(msg.vec),1)]  msg <- paste(msg,collapse = "\n")  return(msg)} #正文一般为第一个空行后的文本get.subj <- function(msg.vec){  subj <- msg.vec[grepl("Subject: ",msg.vec)]  if(length(subj)>0){    return(strsplit(subj,"Subject: ")[[1]][2])  }  else{    return("")  }}  #get.subj 函数获取主题,有些邮件可能没有主题,返回空行即可get.date <- function(msg.vec){  date.grep <- grepl("^Date: ",msg.vec)  date.grepl <- which(date.grep == TRUE)  date <- msg.vec[date.grepl[1]]  date<- strsplit(date,"\\+|\\-|\\: ")[[1]][2] #通过对邮件的观察可以发现时间一般为这样的格式:Date: Th,22 Jan 2002 19:23:45 +5000  #或者 Date: 22 Oct 2002 12:22:11 -2300  date <- gsub("^\\s+|\\s+$","",date) #去掉开头和结尾处的空格 ^符号匹配开头字符,$符号匹配结尾字符  return(strtrim(date,25))} #get.date 函数是最麻烦的,因为邮件中的时间格式比较复杂  #首先我们知道时间在邮件里以Date: 这样的格式开头  #因此应该有一个grepl("^Date:",msg.vec)正则匹配,之后再对  #日期的行进行处理

 至此,以上函数提取的所有特征均为字符型向量

easyham.docs <- dir(easyham.path)easyham.docs <- easyham.docs[which(easyham.docs!="cmds")] #cmds文件不是邮件easyham.parse <- lapply(easyham.docs, function(p){  parse.email(paste(easyham.path,p,sep = ""))})easyham.matrix <- do.call(rbind,easyham.parse) #do.call函数可以将一个列表按行或者列整合成一个数组,一般与lapply函数一起使用allparse.df <- data.frame(easyham.matrix,stringsAsFactors = F)names(allparse.df) <- c("Date","From.Email","Subject","Message","Path")

 需要将时间变量转换为可以计算的类型,日期有两种可能的格式:星期,日,月,年,时分秒和日月年时分秒。

date.converter <- function(date,pattern1,pattern2){  pattern1.conver <- strptime(date,pattern1)  pattern2.conver <- strptime(date,pattern2)  pattern1.conver[is.na(pattern1.conver)] <-     pattern2.conver[is.na(pattern1.conver)]  return(pattern1.conver)}pattern1 <- "%a, %d %b %Y %H:%M:%S"pattern2 <- "%d %b %Y %H:%M:%S"Sys.setlocale(locale = "US") #需要将环境设置为英文环境,否则转换出错

 对数据类型进行整理,选择训练集,以便进行后续的计算

allparse.df$Date <- date.converter(allparse.df$Date,pattern1,pattern2)allparse.df$Subject <- tolower(allparse.df$Subject)allparse.df$From.Email <- tolower(allparse.df$From.Email)priority <- allparse.df[with(allparse.df,order(Date)),]priority.train <- priority[1:(round(nrow(priority)/2)),]

至此已经完成第一步:数据整理及特征提取,已经提取出的特征包括:邮件的来源(From.Email)、邮件接受日期(Date)、邮件主题(Subject)、正文等

三、计算各个特征的权重

 首先从最简单的From权重开始考虑。我们知道,假如在一堆邮件中,来自某个地方的邮件数量明显多于其他邮件,这可以说明这个地方的邮件值得重视(垃圾邮件例外)

from.weight <- ddply(priority.train[,-1],.(From.Email),summarise,Freq = length(Subject))  #由于时间戳格式会导致ddply出错,因此去掉那一列,或者改为字符串格式from.weight <- from.weight[with(from.weight,order(Freq)),]from.weight.ex <- subset(from.weight,Freq>6)head(from.weight)

 可以通过可视化更加直观地查看邮件来源地址的情况

from.plot <- ggplot(data = from.weight.ex)+  geom_rect(aes(xmin = 1:nrow(from.weight.ex)-0.5,                xmax = 1:nrow(from.weight.ex)+0.5,                ymin = 0,                ymax = Freq,                fill = "lightgrey",                color = "darkblue"))+  scale_x_continuous(breaks = 1:nrow(from.weight.ex),labels = from.weight.ex$From.Email)+  coord_flip()+  scale_fill_manual(values = c("lightgrey" = "lightgrey"), guide = "none") +  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +  ylab("Number of Emails Received (truncated at 6)") +  xlab("Sender Address") +  theme_bw() +  theme(axis.text.y = element_text(size = 5, hjust = 1))from.plotggsave(plot = from.plot,       filename = file.path("0011_from_scales.pdf"),       height = 4.8,       width = 7)

这里写图片描述

由图可以看出邮件数量差距较大,如果直接以数量为权重,将会出现倾斜,因此我们用对数变换平滑

from.weight <- transform(from.weight,weights = log(Freq+1))#这里之所以要+1,是因为可能有些Freq为0时,会出现负无穷这样极端的数据 #看看平滑后的数据分布fweight <- subset(from.weight,Freq>6)fromweight.plot <- ggplot(fweight)+  geom_rect(aes(xmin = 1:nrow(fweight)-0.5,                xmax = 1:nrow(fweight)+0.5,                ymin = 0,                ymax = weights,                fill = "lightgrey",                color = "darkblue"))+  scale_x_continuous(breaks = 1:nrow(fweight),labels = fweight$From.Email)+  coord_flip()+  scale_fill_manual(values = c("lightgrey" = "lightgrey"), guide = "none") +  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +  ylab("Weight") +  xlab("Sender Address") +  theme_bw() +  theme(axis.text.y = element_text(size = 5, hjust = 1))fromweight.plotggsave(plot = fromweight.plot,       filename = file.path("fromweight.pdf"),       height = 4.8,       width = 7)

这里写图片描述

邮件线程活跃度权重的计算

重点是识别线程:什么是线程?本案例数据中没有线程ID,不易识别,不过可以根据主题,我们默认拥有共同主题的邮件是某个线程的一部分,找出这些线程,分组,再查看这些线程之间的时间间隔,以及在单位时间内这些线程进行了多少活动,就知道活跃程度了

find.threads <- function(email.df){  response.threads <- strsplit(email.df$Subject,"re: ")  # re: 开头的主题为回复线程  is.threads <- sapply(response.threads, function(p){    ifelse(p[[1]][1]=="",TRUE,FALSE)  })  #判断线程  threads <- response.threads[is.threads]  senders <- email.df$From.Email[is.threads]  threads <- sapply(threads, function(t){    paste(t[2:length(t)],collapse = "re:")  })  return(cbind(senders,threads))} ### 找线程活动及发送邮件的地址 threads.matrix <- find.threads(priority.train) ##计算线程活跃度email.threads <- function(threads.matrix){  senders <- threads.matrix[,1]  senders.freq <- table(senders) #统计发送者的频数  #计算发送者权重  senders.matrix <- cbind(names(senders.freq),                          senders.freq,log(senders.freq+1))  senders.df <- data.frame(senders.matrix,stringsAsFactors = F)  row.names(senders.df) <- 1:nrow(senders.df)  names(senders.df) <- c("From.Email","Freq","Weight")  senders.df$Freq <- as.numeric(senders.df$Freq)  senders.df$Weight <- as.numeric(senders.df$Weight)  return(senders.df)}senders.df <- email.threads(threads.matrix)

根据邮件内容追加权重,即对每一属于线程的邮件,统计词频
首先应该想办法衡量某一线程的活跃度
可以通过在这一线程中收了多少封邮件进行衡量。

threads.count <- function(thread,email.df){  threads.times <- email.df$Date[which(    email.df$Subject == thread |     email.df$Subject == paste("re:",thread))]  freq <- length(threads.times)#计算该线程中邮件的数量  min.time <- min(threads.times)  max.time <- max(threads.times)  #线程的时间跨度  time.span <- as.numeric(difftime(max.time,min.time))  if(freq<2){    return(c(NA,NA,NA))#不活跃线程  }  else{    trans.weight <- freq/time.span #单位时间内收发的邮件数    log.trans.weight <- 10+log10(trans.weight)#计算对数权重    return(c(freq,time.span,log.trans.weight))  }} ###计算所有的线程活跃度权重get.threads <- function(threads.matrix,email.df){  #获取线程,去重  threads <- unique(threads.matrix[,2])  #对每一个线程求权重  threads.count <- lapply(threads, function(p) threads.count(p,email.df))  threads.weight <- do.call(rbind,threads.count)  return(cbind(threads,threads.weight))}threads.weight <- get.threads(threads.matrix,priority.train)threads.weight <- data.frame(threads.weight,stringsAsFactors = F)names(threads.weight) <- c("Thread","Freq","Response","Weight")threads.weight$Freq <- as.numeric(threads.weight$Freq)threads.weight$Response <- as.numeric(threads.weight$Response)threads.weight$Weight <- as.numeric(threads.weight$Weight) ###选择活跃线程threads.weight <- subset(threads.weight,is.na(threads.weight$Freq)==FALSE)

至此已经计算出了来源地址权重(from),发送者权重(sender),线程活跃度权重(threads)
接下来计算高频词项的权重,这里需要用到文本挖掘的工具,先算出词频

term.count <- function(term.vec,control){  vec.Corpus <- Corpus(VectorSource(term.vec))  vec.tdm <- TermDocumentMatrix(vec.Corpus,control = control)  return(rowSums(as.matrix(vec.tdm)))}thread.terms <- term.count(threads.weight$Thread,control =                              list(stopwords=stopwords()))thread.terms <- names(thread.terms)term.weights <- sapply(thread.terms, function(t) mean                      (threads.weight$Weight[grepl(t,threads.weight$Thread,                                                   fixed = TRUE)]))term.weights <- data.frame(list(Term = names(term.weights),Weight = term.weights),                           stringsAsFactors = F,row.names = 1:length(term.weights)) ####上面的代码已经求出了主题中词项的权重 #### 接下来求正文中词项的权重msg.term <- term.count(priority.train$Message,control =                          list(stopwords = stopwords(),                              removePunctuation=TRUE,                              removeNumbers=TRUE))msg.weights <- data.frame(list(Term = names(msg.term),                               Weight = log10(msg.term)),                          stringsAsFactors = F,row.names = 1:length(msg.term))msg.weights <- subset(msg.weights,Weight>0)

四、训练与测试排序算法

 为了给训练数据中的每一封邮件产生优先级,需要将前面计算的所有权重相乘。给出一封邮件,则我们需要找出这封邮件的特征,并根据特征找出已经计算好的该特征的权重,因此,需要一个查找某个特征的权重的函数。

get.weights <- function(search.term,weight.df,term = TRUE){  if(length(search.term)>0){    if(term == TRUE){      term.match <- match(search.term,weight.df$Term)    }    else{      term.match <- match(search.term,weight.df$Thread)    }    match.weight <- weight.df$Weight[which(!is.na(term.match))]    if(length(match.weight)<1){      return(1)      }    else{      return(mean(match.weight))      }  }  else{    return(1)    }} #这个函数根据传入的特征,到指定的权重表(weight.df)搜索对应的权重,如果没有则返回1,因为最终的权重是相乘的,返回1无法影响最终的权重值

把上面所有的过程封装成一个排序的函数,调用以上的函数完成排序。

rank.message <- function(path){  msg <- parse.email(path) #解析一封邮件,返回一个向量,包含from、date、message等  from <- ifelse(length(which(from.weight$From.Email==msg[2]))>0,                 from.weight$weights[which(from.weight$From.Email==msg[2])],1)  #如果这封邮件中的来源地址在来源地址权重表中,返回对应的权重,否则返回1  thread.from <- ifelse(length(which(senders.df$From.Email==msg[2]))>0,                        senders.df$Weight[which(senders.df$From.Email==msg[2])],1) #如果这封邮件的发送者在发送者权重表中,返回对应的权重,否则返回1  subj <- strsplit(tolower(msg[3]),"re: ")  is.thread <- ifelse(subj[[1]][1]=="",TRUE,FALSE)  if(is.thread){    activity <- get.weights(subj[[1]][2],threads.weight,term = FALSE)  }  else{activity <- 1} #如果该邮件在活跃的线程中,返回对应的权重,否则返回1    thread.term <- term.count(msg[3],control = list(stopwords=stopwords()))  thread.term.weight <- get.weights(thread.term,term.weights) #求主题词项的权重    msg.term <- term.count(msg[4],control = list(stopwords=stopwords(),                                               removePunctuation=TRUE,                                               removeNumbers=TRUE))  msg.weight <- get.weights(msg.term,msg.weights)  #求正文词项的权重  rank <- prod(from,thread.from,thread.term.weight,activity,msg.weight)#最终的权重等于各个权重的连乘积  return(c(msg[1],msg[2],msg[3],rank))}

 使用训练集进行训练。

train.paths <- priority$Path[1:(round(nrow(priority)/2))]test.paths <- priority$Path[((round(nrow(priority)/2))+1):nrow(priority)]train.ranks <- lapply(train.paths, rank.message)train.rank.matrix <- do.call(rbind,train.ranks)train.rank.matrix <- cbind(train.paths,train.rank.matrix,"TRAINING")train.rank.df <- data.frame(train.rank.matrix,stringsAsFactors = F)names(train.rank.df)<-c("Message","Date","From","Subj","Rank","Type")train.rank.df$Rank <- as.numeric(train.rank.df$Rank)priority.threshold <- median(train.rank.df$Rank)#权重阈值,以此判断是否推荐邮件train.rank.df$Priority <- ifelse(train.rank.df$Rank>priority.threshold,1,0)rank.plot <- ggplot(train.rank.df,aes(x = Rank))+  stat_density(aes(fill = "darkblue"))+  geom_vline(xintercept = priority.threshold, linetype = 2) +  scale_fill_manual(values = c("darkblue" = "darkblue"), guide = "none") +  theme_bw()rank.plotggsave("权重密度图.png",plot = rank.plot,width = 7,height = 4.7)

权重密度图

 从图中可以看出,中位数将大部分向右下角倾斜的数据判定为优先邮件,这是比较合理的

使用测试集进行测试

test.paths <-test.paths[grepl("E:/R/ML",test.paths)]#测试集路径中的某些文件名出错,因此进行一些选择test.ranks <- lapply(test.paths, rank.message)test.rank.matrix <- do.call(rbind,test.ranks)test.rank.matrix <- cbind(test.paths,test.rank.matrix,"TEST")test.rank.df <- data.frame(test.rank.matrix,stringsAsFactors = F)names(test.rank.df)<-c("Message","Date","From","Subj","Rank","Type")test.rank.df$Rank <- as.numeric(test.rank.df$Rank)test.rank.df$Priority <- ifelse(test.rank.df$Rank>priority.threshold,1,0)#根据训练出的阈值进行判断

最后将训练出的数据与测试的数据放在一起对比

final.df <- rbind(train.rank.df,test.rank.df)both.plot <- ggplot(subset(final.df,Type == "TRAINING"),aes(x = Rank))+  stat_density(aes(fill = Type,alpha = 0.65))+  stat_density(data = subset(final.df,Type == "TEST"),aes(fill = Type,alpha=0.65))+  geom_vline(xintercept = priority.threshold,linetype = 2)+  scale_alpha(guide = "none")+  scale_fill_manual(values = c("TRAINING"="darkblue","TEST"="darkred"))+  theme_bw()both.plotggsave("E:/R/ML/both_weight_density.png",plot = both.plot,width = 7,height = 4.7)

这里写图片描述

五、总结

这个例子的训练重点在以下几个方面:一是练习如何选取特征,二是如何利用R语言提取特征,特别是常用的那些函数,比如apply族函数、plyr包的数据整合函数以及字符处理函数、时间转换函数等等,三是练习数据可视化。总之,R语言的操作需要不断的练习,解决实际问题,才会提高。做这个例子最大的感受就是:函数很多,自编的、系统的等等,因此一定要有一个整体的概念,先调用什么函数,后调用什么函数,要做到心中有数,才不会乱。

0 0
原创粉丝点击