Spark成长之路(10)-CountVectorizer

来源:互联网 发布:身份证验证接口java 编辑:程序博客网 时间:2024/06/14 13:52

CountVectorizer

简介

用文档中单个单词出现的次数组成一个向量。

代码

object CountVectorizerExample {  def main(args: Array[String]): Unit = {    val spark = SparkSession.builder().getOrCreate()    val df = spark.createDataFrame(Seq(      (0, Array("a", "b", "c")),      (1, Array("a", "b", "b", "c", "a", "a"))    )).toDF("id", "words")    // fit a CountVectorizerModel from the corpus    val cvModel: CountVectorizerModel = new CountVectorizer()      .setInputCol("words")      .setOutputCol("features")      .setVocabSize(3)      .setMinDF(2)      .fit(df)    // alternatively, define CountVectorizerModel with a-priori vocabulary    val cvm = new CountVectorizerModel(Array("a", "b", "c", "c"))      .setInputCol("words")      .setOutputCol("features")    cvModel.transform(df).show(false)  }}

输出

+---+------------------+-------------------------+|id |words             |features                 |+---+------------------+-------------------------+|0  |[a, b, c]         |(3,[0,1,2],[1.0,1.0,1.0])||1  |[a, b, b, c, a, a]|(3,[0,1,2],[3.0,2.0,1.0])|+---+------------------+-------------------------+
原创粉丝点击