特征的转换_05-标签索引的转换与特征的组合

来源：互联网发布：c语言是否窗口编辑：程序博客网时间：2024/06/11 04:40

笔记整理时间：2017年1月20日
笔记整理人：王小草

1.StringIndexer

将类别型的标签变量转换成数字索引。根据该类别出现的频数由高到低排列分别对应索引0,1,2，…

如果输入的是数字，那么会将数字转换成字符串类型，然后再进行相同方式的建立对应的索引。

比如可以将如下第二列的类别变量：

id category 0 a 1 b 2 c 3 a 4 a 5 c

上面有三个类别a,b,c,a的频数最高所以标注为0，其次是c,b.转换成如下第三列的对应的索引：

id category categoryIndex 0 a 0.0 1 b 2.0 2 c 1.0 3 a 0.0 4 a 0.0 5 c 1.0

那如果在训练集中有3个类别，但测试集或者新的数据进来有第4个类别d了呢？此时可以选择两种机制：
一是抛出异常（这是默认的）
二是忽略掉那个新的类别的一整组数据。

根据第二种方法，比如进来了一组新数据如下：

id category 0 a 1 b 2 c 3 d

因为d在之前编码中没有出现，所以自动忽略啦~

id category categoryIndex 0 a 0.0 1 b 2.0 2 c 1.

代码：

import org.apache.spark.ml.feature.StringIndexerval df = spark.createDataFrame(  Seq((0, "a"),      (1, "b"),      (2, "c"),      (3, "a"),      (4, "a"),      (5, "c"))).toDF("id", "category")val indexer = new StringIndexer()  .setInputCol("category")  .setOutputCol("categoryIndex")val indexed = indexer.fit(df).transform(df)indexed.show()

2.IndexToString

与StringIndexer正好相反，将标签索引转换成标签字符串。
一般情况下，为了之后运算的方便，会事先用StringIndexer将字符串的类别标签转换成索引，而最后预测或输出的结果中再将索引转换成原来的字符串标签。

代码：

object FeatureTransform01 {  def main(args: Array[String]) {    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)    val conf = new SparkConf().setAppName("FeatureTransform01").setMaster("local")    val sc = new SparkContext(conf)    val spark = SparkSession      .builder()      .appName("Feature Extraction")      .config("spark.some.config.option", "some-value")      .getOrCreate()    //创建一个DataFrame    val df = spark.createDataFrame(Seq(      (0, "a"),      (1, "b"),      (2, "c"),      (3, "a"),      (4, "a"),      (5, "c")    )).toDF("id", "category")    //将字符串标签转换成索引标签    val indexer = new StringIndexer()      .setInputCol("category")      .setOutputCol("categoryIndex")      .fit(df)    val indexed = indexer.transform(df)    println(s"Transformed string column '${indexer.getInputCol}' " +      s"to indexed column '${indexer.getOutputCol}'")    indexed.show()    // 将索引标签转换成字符串    val converter = new IndexToString()      .setInputCol("categoryIndex")      .setOutputCol("originalCategory")    val converted = converter.transform(indexed)    println(s"Transformed indexed column '${converter.getInputCol}' back to original string " +      s"column '${converter.getOutputCol}' using labels in metadata")    converted.select("id", "categoryIndex", "originalCategory").show()    sc.stop()  }}

打印结果

Transformed string column 'category' to indexed column 'categoryIndex'+---+--------+-------------+| id|category|categoryIndex|+---+--------+-------------+|  0|       a|          0.0||  1|       b|          2.0||  2|       c|          1.0||  3|       a|          0.0||  4|       a|          0.0||  5|       c|          1.0|+---+--------+-------------+Transformed indexed column 'categoryIndex' back to original string column 'originalCategory' using labels in metadata+---+-------------+----------------+| id|categoryIndex|originalCategory|+---+-------------+----------------+|  0|          0.0|               a||  1|          2.0|               b||  2|          1.0|               c||  3|          0.0|               a||  4|          0.0|               a||  5|          1.0|               c|+---+-------------+----------------+

3.VectorIndexer

输入一组特征向量，VectorIndexer可以根据输入的参数maxCategories自动识别处类别变量，然后将类别变量转换成索引标签，从而输出一组新的全部使用数字表征的特征向量。

在做决策树等分类模型前都需要强字符串类别标签转换成索引标签。

代码如下：

    // 读入数据    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")    // 将设置最大类别数为10 ，将种类小于10的变量识别为类别变量并转换成索引    val indexer = new VectorIndexer()      .setInputCol("features")      .setOutputCol("indexed")      .setMaxCategories(10)    val indexerModel = indexer.fit(data)    val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet    println(s"Chose ${categoricalFeatures.size} categorical features: " +      categoricalFeatures.mkString(", "))    // 转换    val indexedData = indexerModel.transform(data)    indexedData.show()

4.OneHotEncoder

独热编码在很多地方都需要用到。将一列类别变量转换成多列二元变量。
比如在逻辑回归中就需要用独热编码的类别特征。

比如：

abc

转换成：

a 1 0 0 b 0 1 0c 0 0 1

代码：

    // 创建一组尅别类别标签    val df = spark.createDataFrame(Seq(      (0, "a"),      (1, "b"),      (2, "c"),      (3, "a"),      (4, "a"),      (5, "c"),      (6, "d")    )).toDF("id", "category")    // 将字符串类别转换成索引标签    val indexer = new StringIndexer()      .setInputCol("category")      .setOutputCol("categoryIndex")      .fit(df)    val indexed = indexer.transform(df)    // 将索引标签进行独热编码    val encoder = new OneHotEncoder()      .setInputCol("categoryIndex")      .setOutputCol("categoryVec")    val encoded = encoder.transform(indexed)    encoded.show()    sc.stop()

打印结果：

+---+--------+-------------+-------------+| id|category|categoryIndex|  categoryVec|+---+--------+-------------+-------------+|  0|       a|          0.0|(3,[0],[1.0])||  1|       b|          3.0|    (3,[],[])||  2|       c|          1.0|(3,[1],[1.0])||  3|       a|          0.0|(3,[0],[1.0])||  4|       a|          0.0|(3,[0],[1.0])||  5|       c|          1.0|(3,[1],[1.0])||  6|       d|          2.0|(3,[2],[1.0])|+---+--------+-------------+-------------+

第四列结果是稀疏矩阵的表示方法
(3,[0],[1.0])表示，向量长度维3，索引为的的位置值为1，其余位置都是0.

5.VectorAssembler

将多列的特征选择出来并组合成一个特征向量。

比如，以下是3类特征：

id hour mobile userFeatures clicked 0 18 1.0 [0.0, 10.0, 0.5] 1.0

为了模型输入的格式方便，想要将3类特征组合成一个特征向量，并放在一列中：

id hour mobile userFeatures clicked features 0 18 1.0 [0.0, 10.0, 0.5] 1.0 [18.0, 1.0, 0.0, 10.0, 0.5]

代码：

   //创建一个DataFrame，3个特征    val dataset = spark.createDataFrame(      Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")    // 将三个特征合并成一个特征向量    val assembler = new VectorAssembler()      .setInputCols(Array("hour", "mobile", "userFeatures"))      .setOutputCol("features")    val output = assembler.transform(dataset)    println("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")    output.show(false)

打印结果：

+---+----+------+--------------+-------+-----------------------+|id |hour|mobile|userFeatures  |clicked|features               |+---+----+------+--------------+-------+-----------------------+|0  |18  |1.0   |[0.0,10.0,0.5]|1.0    |[18.0,1.0,0.0,10.0,0.5]|+---+----+------+--------------+-------+-----------------------+

7.SQLTransformer

可以用sql语句去提取或者重组新的特征，目前（2.1.0版本）只支持”SELECT … FROM THIS …” where “THIS” 这样语句。

比如说：

SELECT a, a + b AS a_b FROM __THIS__SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b

代码

val df = spark.createDataFrame(  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")val sqlTrans = new SQLTransformer().setStatement(  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")sqlTrans.transform(df).show()

上述代码就是将：

id v1 v2 0 1.0 3.0 2 2.0 5.0

转换成了：

id v1 v2 v3 v4 0 1.0 3.0 4.0 3.0 2 2.0 5.0 7.0 10.0

0 0