Spark成长之路(8)-TFIDF

来源:互联网 发布:遮挡照片的软件 编辑:程序博客网 时间:2024/06/06 23:19

TDIDF

  • 简介
  • 源码
  • 输出

简介

文本特征提取算法,给某个文章归档某个类别时特别有用。

源码

object TfIdfExample {  def main(args: Array[String]): Unit = {    val spark = SparkSession.builder().getOrCreate()    spark.sparkContext.setLogLevel("WARN")    val sentenceData = spark.createDataFrame(Seq(      (0.0, "Hi I heard about Spark"),      (0.0, "I wish Java could use case classes"),      (1.0, "Logistic regression models are neat")    )).toDF("label", "sentence")    //将句子切分为词语    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")    val wordsData = tokenizer.transform(sentenceData)    wordsData.show()    // 将句子转换为特征向量    val hashingTF = new HashingTF()      .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)    val featurizedData = hashingTF.transform(wordsData)    featurizedData.show()    // alternatively, CountVectorizer can also be used to get term frequency vectors    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")    val idfModel = idf.fit(featurizedData)    val rescaledData = idfModel.transform(featurizedData)    rescaledData.select("features", "label").show()    rescaledData.show()  }}

输出

+-----+--------------------+--------------------+|label|            sentence|               words|+-----+--------------------+--------------------+|  0.0|Hi I heard about ...|[hi, i, heard, ab...||  0.0|I wish Java could...|[i, wish, java, c...||  1.0|Logistic regressi...|[logistic, regres...|+-----+--------------------+--------------------++-----+--------------------+--------------------+--------------------+|label|            sentence|               words|         rawFeatures|+-----+--------------------+--------------------+--------------------+|  0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...||  0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...||  1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|+-----+--------------------+--------------------+--------------------++--------------------+-----+|            features|label|+--------------------+-----+|(200,[105,129,157...|  0.0||(200,[9,13,89,95,...|  0.0||(200,[4,86,95,138...|  1.0|+--------------------+-----++-----+--------------------+--------------------+--------------------+--------------------+|label|            sentence|               words|         rawFeatures|            features|+-----+--------------------+--------------------+--------------------+--------------------+|  0.0|Hi I heard about ...|[hi, i, heard, ab...|(200,[105,129,157...|(200,[105,129,157...||  0.0|I wish Java could...|[i, wish, java, c...|(200,[9,13,89,95,...|(200,[9,13,89,95,...||  1.0|Logistic regressi...|[logistic, regres...|(200,[4,86,95,138...|(200,[4,86,95,138...|+-----+--------------------+--------------------+--------------------+--------------------+
原创粉丝点击