从新闻数据组中提取TF-IDF特征

来源:互联网 发布:第一军团永远忠诚知乎 编辑:程序博客网 时间:2024/05/23 01:26

为了练习特征提取,我将使用一个非常有名的数据集,叫做20 Newsgroups;这个数据集一般用来文本分类。

1.分析数据内容

查看目录结构和数据结构

val sc = new SparkContext("local[2]","TF-IDF")    val path = "data/20news-bydate-train/*"    val rdd = sc.wholeTextFiles(path)    val text = rdd.map{case (file,text)=> text}    println(text.count())

2.应用基本的分词方法

切分每个文档的原始内容为多个单词,组成集合,下面实现简单的空格分词。及时相对于较小的文本集,不同单词的个数(也就是特征向量的维度)也可能非常高。

val newsgroups = rdd.map{case (file,text)=> file.split("/").takeRight(2).head}val countByGroup = newsgroups.map(n => (n,1)).reduceByKey(_+_).collect.sortBy(-_._2).mkString("\n")println(countByGroup)

3.改进分词效果

上面的处理当中出现很多不是单词的字符(标点符号、数字)利用正则表达式来移除这些字符

val nonWordSplit = text.flatMap(t => t.split("""\W+""").map(_.toLowerCase))val regex = """[^0-9]*""".rval filterNumbers = nonWordSplit.filter(token => regex.pattern.matcher(token).matches())

4.移除停用词

val tokenCounts = filterNumbers.map(t => (t,1)).reduceByKey(_+_)val oreringDesc = Ordering.by[(String,Int),Int](_._2)println(tokenCounts.top(20)(oreringDesc).mkString("\n"))val stopWords = Set("the","a","an","of","in","or","for","by","on","but","is","not","with",  "as","was","if","they","are","this","that","and","it","have","from","at","my","be","to")val tokenCountsFilteredStopWords = tokenCounts.filter{case (k,v) => !stopWords.contains(k)}println(tokenCountsFilteredStopWords.top(20)(oreringDesc).mkString("\n"))val tokenCountsFilteredSize = tokenCountsFilteredStopWords.filter{case (k,v) => k.size >=2}println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))

5.基于频率去除单词

频率很低的单词也要去掉,因为其没有价值,没有足够的训练数据

val tokenCountsFilteredSize = tokenCountsFilteredStopWords.filter{case (k,v) => k.size >=2}println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))val rareTokens = tokenCounts.filter{case (k,v) => v < 2}.map{case (k,v) => k}.collect.toSetval tokenCountsFiltereAll = tokenCountsFilteredSize.filter{case (k,v) => !rareTokens.contains(k)}println(tokenCountsFiltereAll.top(20)(oreringDesc).mkString("\n"))

6.提取词干

利用NLP方法或者NLTK、OpenNLP和Lucenc

7.训练TF-IDF模型

val takens = text.map(doc => tokenize(doc))val dim = math.pow(2,18).toIntval hashingTF = new HashingTF(dim)val tf = hashingTF.transform(takens)tf.cache()val v = tf.first.asInstanceOf[SparseVector]println(v.size)

8.分析TF-IDF权重