NLP处理-Spark中的HashTF与CountVectorizer模型
来源:互联网 发布:软件开发的几个阶段 编辑:程序博客网 时间:2024/06/10 16:04
http://spark.apache.org/docs/latest/ml-features.html#tf-idf
import org.apache.spark.ml.feature._import org.apache.spark.ml.linalg.SparseVectorimport org.apache.spark.sql.SparkSessionimport scala.collection.mutableimport scala.io.Source/** * Created by xubc on 2017/6/3. */object TestX { def main(args: Array[String]): Unit = { val spark = SparkSession.builder .master("local[5]") .appName(this.getClass.getName().stripSuffix("$")) .getOrCreate() val sentenceData = spark.createDataFrame(Seq( (0.0, "Hi I heard about are Spark"), (1.0, "I wish Java could use case spark classes"), (2.0, "Logistic regression regression models are neat I") )).toDF("label", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsData = tokenizer.transform(sentenceData) // HashingTF bow模型// val hashingTF = new HashingTF()// .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(100)// val featurizedData = hashingTF.transform(wordsData) // CountVectorizer bow模型 val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("words").setOutputCol("rawFeatures") .fit(wordsData) val featurizedData = cvModel.transform(wordsData) val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedData) val rescaledData = idfModel.transform(featurizedData) rescaledData.printSchema() val vocabulary = cvModel.vocabulary println(vocabulary.mkString(",")) rescaledData.show(false) rescaledData.foreach(e => { val label = e.getAs[Double]("label") val str = e.getAs[String]("sentence") val words = e.getAs[mutable.WrappedArray[String]]("words").mkString(",") val tf = e.getAs[SparseVector]("rawFeatures") val originWords = tf.indices.map(i => vocabulary(i)).mkString(",") val idf = e.getAs[SparseVector]("features") println( s"""$label $str | $words | $tf $originWords | $idf""".stripMargin) }) }}通过CountVectorizer模型的vocabulary可以回溯tf-idf权重高的词,但是HashTF采用的hash算法能够更高效率计算出tf-idf无法回溯到具体词
1.0 I wish Java could use case spark classes i,wish,java,could,use,case,spark,classes (16,[0,2,4,5,7,8,13,14],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) i,spark,could,java,wish,case,classes,use (16,[0,2,4,5,7,8,13,14],[0.0,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])2.0 Logistic regression regression models are neat I logistic,regression,regression,models,are,neat,i (16,[0,1,3,6,9,15],[1.0,2.0,1.0,1.0,1.0,1.0]) i,regression,are,neat,models,logistic (16,[0,1,3,6,9,15],[0.0,1.3862943611198906,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])0.0 Hi I heard about are Spark hi,i,heard,about,are,spark (16,[0,2,3,10,11,12],[1.0,1.0,1.0,1.0,1.0,1.0]) i,spark,are,about,hi,heard (16,[0,2,3,10,11,12],[0.0,0.28768207245178085,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])
阅读全文
0 0
- NLP处理-Spark中的HashTF与CountVectorizer模型
- Spark CountVectorizer处理文本特征
- 《NLP汉语自然语言处理原理与实践》第四章 NLP中的概率图模型
- NLP汉语自然语言处理原理与实践 4 NLP中的概率图模型
- NLP汉语自然语言处理原理与实践 9 NLP中的深度学习
- NLP | 自然语言处理 - 标注问题与隐马尔科夫模型(Tagging Problems, and Hidden Markov Models)
- NLP中的语言模型(language model)
- CRF模型在NLP中的运用
- CountVectorizer
- 深度学习利器:TensorFlow与NLP模型
- Spark成长之路(10)-CountVectorizer
- Spark中的编程模型
- Spark中的编程模型
- Spark中的编程模型
- NLP--自然语言处理与机器学习会议
- NLP--自然语言处理与机器学习会议
- 1.自然语言处理(NLP)与Python
- spark厦大---特征抽取:CountVectorizer -- spark.ml
- 大学本科《机器人程序设计课程》配套系统镜像使用说明( Ubuntu 14.04.5 + ROS indigo )
- HBuilder学习过程中遇到的问题
- win10+opencv3.2+vs2017配置
- excel函数
- Proxy
- NLP处理-Spark中的HashTF与CountVectorizer模型
- 使用SpringMVC实现RESTFul接口
- Android Studio轻松将零散字符串整理到strings.xml
- SetWaitableTimer 的用法
- Qt简史
- 五分钟学GIS | Docker在GIS中的应用
- iTerm2固定标签名字
- 处理机调度与死锁
- STM32的串口函数_库函数USART_SendData问题和解决方法--硬件复位导致第一个字节丢失