Spark成长之路(8)-TFIDF
来源:互联网 发布:遮挡照片的软件 编辑:程序博客网 时间:2024/06/06 23:19
TDIDF
- 简介
- 源码
- 输出
简介
文本特征提取算法,给某个文章归档某个类别时特别有用。
源码
object TfIdfExample { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().getOrCreate() spark.sparkContext.setLogLevel("WARN") val sentenceData = spark.createDataFrame(Seq( (0.0, "Hi I heard about Spark"), (0.0, "I wish Java could use case classes"), (1.0, "Logistic regression models are neat") )).toDF("label", "sentence") //将句子切分为词语 val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsData = tokenizer.transform(sentenceData) wordsData.show() // 将句子转换为特征向量 val hashingTF = new HashingTF() .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200) val featurizedData = hashingTF.transform(wordsData) featurizedData.show() // alternatively, CountVectorizer can also be used to get term frequency vectors val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedData) val rescaledData = idfModel.transform(featurizedData) rescaledData.select("features", "label").show() rescaledData.show() }}
输出
+-----+--------------------+--------------------+|label| sentence| words|+-----+--------------------+--------------------+| 0.0|Hi I heard about ...|[hi, i, heard, ab...|| 0.0|I wish Java could...|[i, wish, java, c...|| 1.0|Logistic regressi...|[logistic, regres...|+-----+--------------------+--------------------++-----+--------------------+--------------------+--------------------+|label| sentence| words| rawFeatures|+-----+--------------------+--------------------+--------------------+| 0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|| 0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|| 1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|+-----+--------------------+--------------------+--------------------++--------------------+-----+| features|label|+--------------------+-----+|(200,[105,129,157...| 0.0||(200,[9,13,89,95,...| 0.0||(200,[4,86,95,138...| 1.0|+--------------------+-----++-----+--------------------+--------------------+--------------------+--------------------+|label| sentence| words| rawFeatures| features|+-----+--------------------+--------------------+--------------------+--------------------+| 0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|(200,[105,129,157...|| 0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|(200,[9,13,89,95,...|| 1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|(200,[4,86,95,138...|+-----+--------------------+--------------------+--------------------+--------------------+
阅读全文
0 0
- Spark成长之路(8)-TFIDF
- spark mllib机器学习之七 TFIDF
- spark-mllib-TFIDF实现
- Spark成长之路(1)-搭建环境
- Spark成长之路(5)-消息队列
- Spark成长之路(6)-Correlation
- Spark成长之路(7)-Hypothesis testing
- Spark成长之路(9)-Word2Vec
- Spark成长之路(10)-CountVectorizer
- Spark成长之路(11)-ngram
- Spark成长之路(12)-Gradient Descent
- Spark成长之路(2)-RDD中分区依赖系统
- Spark成长之路(3)-再谈RDD的Transformations
- Spark成长之路(4)-分区器系统
- Spark成长之路(13)-DataSet与DataFrame
- TFIDF
- Tfidf
- TFIDF
- 想要给PDF文件添加标签该如何去进行操作
- 排序算法之堆排序
- Web.xml配置详解之context-param
- [leetcode]131. Palindrome Partitioning
- ThreadLocal与ThreadLocalMap源码解析
- Spark成长之路(8)-TFIDF
- 剑指offer面试题[54]-表示数值的字符串
- fatcache源码阅读记录
- 【Redis基础】发布与订阅
- Android_Action和Category属性
- android,actionbar,menu显示,图片,菜单禁用★★★
- 草稿
- asp.net不允许访问.json文件的解决办法
- Spring(四)(springaop的实现)