Spark MLlib特征处理：StringToIndex 字符串索引---原理及实战

来源：互联网发布：金山数据恢复手机版编辑：程序博客网时间：2024/04/30 10:08

原理

1)按String字符串出现次数降序排序；次数相同，自然顺序。

2)按降序顺序转换成DoubleIndex，默认从0.0开始。

代码实战

import org.apache.spark.ml.feature.StringIndexerimport org.apache.spark.sql.{DataFrame, SQLContext}import org.apache.spark.{SparkContext, SparkConf}object StringToIndexExample {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("StringIndexerExample").setMaster("local[6]")    val sc = new SparkContext(conf)    val sqlContext = new SQLContext(sc)    // 将Array转换成DataFrame    val df: DataFrame = sqlContext.createDataFrame(      Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"),(6, "d"))    ).toDF("id", "category")    // 设置输入列、输出列    val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex")    // fit transform    // fit函数：按出现次数降序排序；次数相同，自然顺序。    // 如 a 出现3次;c 出现2次;b 出现1次;d 出现1次    // transform函数：按降序顺序StringToDoubleIndex    // 如 a=>0.0 c=>1.0 b=>2.0 d=>3.0    val indexed = indexer.fit(df).transform(df)    indexed.show()    sc.stop()  }}// 输出// +---+--------+-------------+// | id|category|categoryIndex|// +---+--------+-------------+// |  0|       a|          0.0|// |  1|       b|          2.0|// |  2|       c|          1.0|// |  3|       a|          0.0|// |  4|       a|          0.0|// |  5|       c|          1.0|// |  6|       d|          3.0|// +---+--------+-------------+

0 0