三种特征选择方法及Spark MLlib调用实例（Scala/Java/python）

来源：互联网发布：icmp返回报文端口号编辑：程序博客网时间：2024/06/05 11:57

来源：http://blog.csdn.net/liulingyuan6/article/details/53413728

VectorSlicer

算法介绍：

VectorSlicer是一个转换器输入特征向量，输出原始特征向量子集。VectorSlicer接收带有特定索引的向量列，通过对这些索引的值进行筛选得到新的向量集。可接受如下两种索引

1.整数索引，setIndices()。

2.字符串索引代表向量中特征的名字，此类要求向量列有AttributeGroup，因为该工具根据Attribute来匹配名字字段。

指定整数或者字符串类型都是可以的。另外，同时使用整数索引和字符串名字也是可以的。不允许使用重复的特征，所以所选的索引或者名字必须是没有独一的。注意如果使用名字特征，当遇到空值的时候将会报错。

输出将会首先按照所选的数字索引排序（按输入顺序），其次按名字排序（按输入顺序）。

示例：
假设我们有一个DataFrame含有userFeatures列：

userFeatures

------------------

[0.0, 10.0, 0.5]

userFeatures是一个向量列包含3个用户特征。假设userFeatures的第一列全为0，我们希望删除它并且只选择后两项。我们可以通过索引setIndices(1,2)来选择后两项并产生一个新的features列：

userFeatures | features

------------------|-----------------------------

[0.0, 10.0, 0.5] | [10.0, 0.5]

假设我们还有如同["f1","f2", "f3"]的属性，那可以通过名字setNames("f2","f3")的形式来选择：

userFeatures | features

------------------|-----------------------------

[0.0, 10.0, 0.5] | [10.0, 0.5]

["f1", "f2","f3"] | ["f2", "f3"]

调用示例：

Scala：

[plain] view plain copy
import java.util.Arrays  
  
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}  
import org.apache.spark.ml.feature.VectorSlicer  
import org.apache.spark.ml.linalg.Vectors  
import org.apache.spark.sql.Row  
import org.apache.spark.sql.types.StructType  
  
val data = Arrays.asList(Row(Vectors.dense(-2.0, 2.3, 0.0)))  
  
val defaultAttr = NumericAttribute.defaultAttr  
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)  
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])  
  
val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))  
  
val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")  
  
slicer.setIndices(Array(1)).setNames(Array("f3"))  
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))  
  
val output = slicer.transform(dataset)  
println(output.select("userFeatures", "features").first())  

Java：

[java] view plain copy
import java.util.List;  
  
import com.google.common.collect.Lists;  
  
import org.apache.spark.ml.attribute.Attribute;  
import org.apache.spark.ml.attribute.AttributeGroup;  
import org.apache.spark.ml.attribute.NumericAttribute;  
import org.apache.spark.ml.feature.VectorSlicer;  
import org.apache.spark.ml.linalg.Vectors;  
import org.apache.spark.sql.Dataset;  
import org.apache.spark.sql.Row;  
import org.apache.spark.sql.RowFactory;  
import org.apache.spark.sql.types.*;  
  
Attribute[] attrs = new Attribute[]{  
  NumericAttribute.defaultAttr().withName("f1"),  
  NumericAttribute.defaultAttr().withName("f2"),  
  NumericAttribute.defaultAttr().withName("f3")  
};  
AttributeGroup group = new AttributeGroup("userFeatures", attrs);  
  
List<Row> data = Lists.newArrayList(  
  RowFactory.create(Vectors.sparse(3, new int[]{0, 1}, new double[]{-2.0, 2.3})),  
  RowFactory.create(Vectors.dense(-2.0, 2.3, 0.0))  
);  
  
Dataset<Row> dataset =  
  spark.createDataFrame(data, (new StructType()).add(group.toStructField()));  
  
VectorSlicer vectorSlicer = new VectorSlicer()  
  .setInputCol("userFeatures").setOutputCol("features");  
  
vectorSlicer.setIndices(new int[]{1}).setNames(new String[]{"f3"});  
// or slicer.setIndices(new int[]{1, 2}), or slicer.setNames(new String[]{"f2", "f3"})  
  
Dataset<Row> output = vectorSlicer.transform(dataset);  
  
System.out.println(output.select("userFeatures", "features").first());  

Python：

[python] view plain copy
from pyspark.ml.feature import VectorSlicer  
from pyspark.ml.linalg import Vectors  
from pyspark.sql.types import Row  
  
df = spark.createDataFrame([  
    Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3}),),  
    Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]),)])  
  
slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1])  
  
output = slicer.transform(df)  
  
output.select("userFeatures", "features").show()  

RFormula

算法介绍：

RFormula通过R模型公式来选择列。支持R操作中的部分操作，包括‘~’, ‘.’, ‘:’, ‘+’以及‘-‘，基本操作如下：

1. ~分隔目标和对象

2. +合并对象，“+ 0”意味着删除空格

3. :交互（数值相乘，类别二值化）

4. . 除了目标外的全部列

假设a和b为两列：

1. y ~ a + b表示模型y ~ w0 + w1 * a +w2 * b其中w0为截距，w1和w2为相关系数。

2. y ~a + b + a:b – 1表示模型y ~ w1* a + w2 * b + w3 * a * b，其中w1，w2，w3是相关系数。

RFormula产生一个向量特征列以及一个double或者字符串标签列。如果类别列是字符串类型，它将通过StringIndexer转换为double类型。如果标签列不存在，则输出中将通过规定的响应变量创造一个标签列。

示例：

假设我们有一个DataFrame含有id,country, hour和clicked四列：

id | country |hour | clicked

---|---------|------|---------

7 | "US" | 18 | 1.0

8 | "CA" | 12 | 0.0

9 | "NZ" | 15 | 0.0

如果我们使用RFormula公式clicked ~ country+ hour，则表明我们希望基于country和hour预测clicked，通过转换我们可以得到如下DataFrame：

---|---------|------|---------|------------------|-------

7 | "US" | 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0

8 | "CA" | 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0

9 | "NZ" | 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0

调用示例：

Scala：

[plain] view plain copy
import org.apache.spark.ml.feature.RFormula  
  
val dataset = spark.createDataFrame(Seq(  
  (7, "US", 18, 1.0),  
  (8, "CA", 12, 0.0),  
  (9, "NZ", 15, 0.0)  
)).toDF("id", "country", "hour", "clicked")  
val formula = new RFormula()  
  .setFormula("clicked ~ country + hour")  
  .setFeaturesCol("features")  
  .setLabelCol("label")  
val output = formula.fit(dataset).transform(dataset)  
output.select("features", "label").show()  

Java：

[java] view plain copy
import java.util.Arrays;  
import java.util.List;  
  
import org.apache.spark.ml.feature.RFormula;  
import org.apache.spark.sql.Dataset;  
import org.apache.spark.sql.Row;  
import org.apache.spark.sql.RowFactory;  
import org.apache.spark.sql.types.StructField;  
import org.apache.spark.sql.types.StructType;  
  
import static org.apache.spark.sql.types.DataTypes.*;  
  
StructType schema = createStructType(new StructField[]{  
  createStructField("id", IntegerType, false),  
  createStructField("country", StringType, false),  
  createStructField("hour", IntegerType, false),  
  createStructField("clicked", DoubleType, false)  
});  
  
List<Row> data = Arrays.asList(  
  RowFactory.create(7, "US", 18, 1.0),  
  RowFactory.create(8, "CA", 12, 0.0),  
  RowFactory.create(9, "NZ", 15, 0.0)  
);  
  
Dataset<Row> dataset = spark.createDataFrame(data, schema);  
RFormula formula = new RFormula()  
  .setFormula("clicked ~ country + hour")  
  .setFeaturesCol("features")  
  .setLabelCol("label");  
Dataset<Row> output = formula.fit(dataset).transform(dataset);  
output.select("features", "label").show();  

Python：

[python] view plain copy
from pyspark.ml.feature import RFormula  
  
dataset = spark.createDataFrame(  
    [(7, "US", 18, 1.0),  
     (8, "CA", 12, 0.0),  
     (9, "NZ", 15, 0.0)],  
    ["id", "country", "hour", "clicked"])  
formula = RFormula(  
    formula="clicked ~ country + hour",  
    featuresCol="features",  
    labelCol="label")  
output = formula.fit(dataset).transform(dataset)  
output.select("features", "label").show()  

ChiSqSelector

算法介绍：

ChiSqSelector代表卡方特征选择。它适用于带有类别特征的标签数据。ChiSqSelector根据类别的独立卡方2检验来对特征排序，然后选取类别标签主要依赖的特征。它类似于选取最有预测能力的特征。

示例：

假设我们有一个DataFrame含有id,features和clicked三列，其中clicked为需要预测的目标：

id | features | clicked

---|-----------------------|---------

7 | [0.0, 0.0, 18.0, 1.0] | 1.0

8 | [0.0, 1.0, 12.0, 0.0] | 0.0

9 | [1.0, 0.0, 15.0, 0.1] | 0.0

如果我们使用ChiSqSelector并设置numTopFeatures为1，根据标签clicked，features中最后一列将会是最有用特征：

id | features | clicked | selectedFeatures

---|-----------------------|---------|------------------

7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0]

8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0]

9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1]

调用示例：

Scala：

[plain] view plain copy
import org.apache.spark.ml.feature.ChiSqSelector  
import org.apache.spark.ml.linalg.Vectors  
  
val data = Seq(  
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),  
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),  
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)  
)  
  
val df = spark.createDataset(data).toDF("id", "features", "clicked")  
  
val selector = new ChiSqSelector()  
  .setNumTopFeatures(1)  
  .setFeaturesCol("features")  
  .setLabelCol("clicked")  
  .setOutputCol("selectedFeatures")  
  
val result = selector.fit(df).transform(df)  
result.show()  

Java：

[java] view plain copy
import java.util.Arrays;  
import java.util.List;  
  
import org.apache.spark.ml.feature.ChiSqSelector;  
import org.apache.spark.ml.linalg.VectorUDT;  
import org.apache.spark.ml.linalg.Vectors;  
import org.apache.spark.sql.Row;  
import org.apache.spark.sql.RowFactory;  
import org.apache.spark.sql.types.DataTypes;  
import org.apache.spark.sql.types.Metadata;  
import org.apache.spark.sql.types.StructField;  
import org.apache.spark.sql.types.StructType;  
  
List<Row> data = Arrays.asList(  
  RowFactory.create(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),  
  RowFactory.create(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),  
  RowFactory.create(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)  
);  
StructType schema = new StructType(new StructField[]{  
  new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),  
  new StructField("features", new VectorUDT(), false, Metadata.empty()),  
  new StructField("clicked", DataTypes.DoubleType, false, Metadata.empty())  
});  
  
Dataset<Row> df = spark.createDataFrame(data, schema);  
  
ChiSqSelector selector = new ChiSqSelector()  
  .setNumTopFeatures(1)  
  .setFeaturesCol("features")  
  .setLabelCol("clicked")  
  .setOutputCol("selectedFeatures");  
  
Dataset<Row> result = selector.fit(df).transform(df);  
result.show();  

Python：

[python] view plain copy
from pyspark.ml.feature import ChiSqSelector  
from pyspark.ml.linalg import Vectors  
  
df = spark.createDataFrame([  
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),  
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),  
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])  
  
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",  
                         outputCol="selectedFeatures", labelCol="clicked")  
  
result = selector.fit(df).transform(df)  
result.show()  

0 0