来源:http://blog.csdn.net/liulingyuan6/article/details/53413728
VectorSlicer
算法介绍:
VectorSlicer是一个转换器输入特征向量,输出原始特征向量子集。VectorSlicer接收带有特定索引的向量列,通过对这些索引的值进行筛选得到新的向量集。可接受如下两种索引
1.整数索引,setIndices()。
2.字符串索引代表向量中特征的名字,此类要求向量列有AttributeGroup,因为该工具根据Attribute来匹配名字字段。
指定整数或者字符串类型都是可以的。另外,同时使用整数索引和字符串名字也是可以的。不允许使用重复的特征,所以所选的索引或者名字必须是没有独一的。注意如果使用名字特征,当遇到空值的时候将会报错。
输出将会首先按照所选的数字索引排序(按输入顺序),其次按名字排序(按输入顺序)。
示例:
假设我们有一个DataFrame含有userFeatures列:
userFeatures
------------------
[0.0, 10.0, 0.5]
userFeatures是一个向量列包含3个用户特征。假设userFeatures的第一列全为0,我们希望删除它并且只选择后两项。我们可以通过索引setIndices(1,2)来选择后两项并产生一个新的features列:
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
假设我们还有如同["f1","f2", "f3"]的属性,那可以通过名字setNames("f2","f3")的形式来选择:
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
["f1", "f2","f3"] | ["f2", "f3"]
调用示例:
Scala:
- import java.util.Arrays
-
- import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
- import org.apache.spark.ml.feature.VectorSlicer
- import org.apache.spark.ml.linalg.Vectors
- import org.apache.spark.sql.Row
- import org.apache.spark.sql.types.StructType
-
- val data = Arrays.asList(Row(Vectors.dense(-2.0, 2.3, 0.0)))
-
- val defaultAttr = NumericAttribute.defaultAttr
- val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
- val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
-
- val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))
-
- val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
-
- slicer.setIndices(Array(1)).setNames(Array("f3"))
- // or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))
-
- val output = slicer.transform(dataset)
- println(output.select("userFeatures", "features").first())
Java:- import java.util.List;
-
- import com.google.common.collect.Lists;
-
- import org.apache.spark.ml.attribute.Attribute;
- import org.apache.spark.ml.attribute.AttributeGroup;
- import org.apache.spark.ml.attribute.NumericAttribute;
- import org.apache.spark.ml.feature.VectorSlicer;
- import org.apache.spark.ml.linalg.Vectors;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.*;
-
- Attribute[] attrs = new Attribute[]{
- NumericAttribute.defaultAttr().withName("f1"),
- NumericAttribute.defaultAttr().withName("f2"),
- NumericAttribute.defaultAttr().withName("f3")
- };
- AttributeGroup group = new AttributeGroup("userFeatures", attrs);
-
- List<Row> data = Lists.newArrayList(
- RowFactory.create(Vectors.sparse(3, new int[]{0, 1}, new double[]{-2.0, 2.3})),
- RowFactory.create(Vectors.dense(-2.0, 2.3, 0.0))
- );
-
- Dataset<Row> dataset =
- spark.createDataFrame(data, (new StructType()).add(group.toStructField()));
-
- VectorSlicer vectorSlicer = new VectorSlicer()
- .setInputCol("userFeatures").setOutputCol("features");
-
- vectorSlicer.setIndices(new int[]{1}).setNames(new String[]{"f3"});
-
-
- Dataset<Row> output = vectorSlicer.transform(dataset);
-
- System.out.println(output.select("userFeatures", "features").first());
Python:- from pyspark.ml.feature import VectorSlicer
- from pyspark.ml.linalg import Vectors
- from pyspark.sql.types import Row
-
- df = spark.createDataFrame([
- Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3}),),
- Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]),)])
-
- slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1])
-
- output = slicer.transform(df)
-
- output.select("userFeatures", "features").show()
RFormula
算法介绍:
RFormula通过R模型公式来选择列。支持R操作中的部分操作,包括‘~’, ‘.’, ‘:’, ‘+’以及‘-‘,基本操作如下:
1. ~分隔目标和对象
2. +合并对象,“+ 0”意味着删除空格
3. :交互(数值相乘,类别二值化)
4. . 除了目标外的全部列
假设a和b为两列:
1. y ~ a + b表示模型y ~ w0 + w1 * a +w2 * b其中w0为截距,w1和w2为相关系数。
2. y ~a + b + a:b – 1表示模型y ~ w1* a + w2 * b + w3 * a * b,其中w1,w2,w3是相关系数。
RFormula产生一个向量特征列以及一个double或者字符串标签列。如果类别列是字符串类型,它将通过StringIndexer转换为double类型。如果标签列不存在,则输出中将通过规定的响应变量创造一个标签列。
示例:
假设我们有一个DataFrame含有id,country, hour和clicked四列:
id | country |hour | clicked
---|---------|------|---------
7 | "US" | 18 | 1.0
8 | "CA" | 12 | 0.0
9 | "NZ" | 15 | 0.0
如果我们使用RFormula公式clicked ~ country+ hour,则表明我们希望基于country和hour预测clicked,通过转换我们可以得到如下DataFrame:
id | country |hour | clicked | features | label
---|---------|------|---------|------------------|-------
7 | "US" | 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0
8 | "CA" | 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0
9 | "NZ" | 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0
调用示例:
Scala:
- import org.apache.spark.ml.feature.RFormula
-
- val dataset = spark.createDataFrame(Seq(
- (7, "US", 18, 1.0),
- (8, "CA", 12, 0.0),
- (9, "NZ", 15, 0.0)
- )).toDF("id", "country", "hour", "clicked")
- val formula = new RFormula()
- .setFormula("clicked ~ country + hour")
- .setFeaturesCol("features")
- .setLabelCol("label")
- val output = formula.fit(dataset).transform(dataset)
- output.select("features", "label").show()
Java:- import java.util.Arrays;
- import java.util.List;
-
- import org.apache.spark.ml.feature.RFormula;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
-
- import static org.apache.spark.sql.types.DataTypes.*;
-
- StructType schema = createStructType(new StructField[]{
- createStructField("id", IntegerType, false),
- createStructField("country", StringType, false),
- createStructField("hour", IntegerType, false),
- createStructField("clicked", DoubleType, false)
- });
-
- List<Row> data = Arrays.asList(
- RowFactory.create(7, "US", 18, 1.0),
- RowFactory.create(8, "CA", 12, 0.0),
- RowFactory.create(9, "NZ", 15, 0.0)
- );
-
- Dataset<Row> dataset = spark.createDataFrame(data, schema);
- RFormula formula = new RFormula()
- .setFormula("clicked ~ country + hour")
- .setFeaturesCol("features")
- .setLabelCol("label");
- Dataset<Row> output = formula.fit(dataset).transform(dataset);
- output.select("features", "label").show();
Python:- from pyspark.ml.feature import RFormula
-
- dataset = spark.createDataFrame(
- [(7, "US", 18, 1.0),
- (8, "CA", 12, 0.0),
- (9, "NZ", 15, 0.0)],
- ["id", "country", "hour", "clicked"])
- formula = RFormula(
- formula="clicked ~ country + hour",
- featuresCol="features",
- labelCol="label")
- output = formula.fit(dataset).transform(dataset)
- output.select("features", "label").show()
ChiSqSelector
算法介绍:
ChiSqSelector代表卡方特征选择。它适用于带有类别特征的标签数据。ChiSqSelector根据类别的独立卡方2检验来对特征排序,然后选取类别标签主要依赖的特征。它类似于选取最有预测能力的特征。
示例:
假设我们有一个DataFrame含有id,features和clicked三列,其中clicked为需要预测的目标:
id | features | clicked
---|-----------------------|---------
7 | [0.0, 0.0, 18.0, 1.0] | 1.0
8 | [0.0, 1.0, 12.0, 0.0] | 0.0
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
如果我们使用ChiSqSelector并设置numTopFeatures为1,根据标签clicked,features中最后一列将会是最有用特征:
id | features | clicked | selectedFeatures
---|-----------------------|---------|------------------
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0]
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0]
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1]
调用示例:
Scala:
- import org.apache.spark.ml.feature.ChiSqSelector
- import org.apache.spark.ml.linalg.Vectors
-
- val data = Seq(
- (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
- (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
- (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
- )
-
- val df = spark.createDataset(data).toDF("id", "features", "clicked")
-
- val selector = new ChiSqSelector()
- .setNumTopFeatures(1)
- .setFeaturesCol("features")
- .setLabelCol("clicked")
- .setOutputCol("selectedFeatures")
-
- val result = selector.fit(df).transform(df)
- result.show()
Java:- import java.util.Arrays;
- import java.util.List;
-
- import org.apache.spark.ml.feature.ChiSqSelector;
- import org.apache.spark.ml.linalg.VectorUDT;
- import org.apache.spark.ml.linalg.Vectors;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
-
- List<Row> data = Arrays.asList(
- RowFactory.create(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
- RowFactory.create(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
- RowFactory.create(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- new StructField("clicked", DataTypes.DoubleType, false, Metadata.empty())
- });
-
- Dataset<Row> df = spark.createDataFrame(data, schema);
-
- ChiSqSelector selector = new ChiSqSelector()
- .setNumTopFeatures(1)
- .setFeaturesCol("features")
- .setLabelCol("clicked")
- .setOutputCol("selectedFeatures");
-
- Dataset<Row> result = selector.fit(df).transform(df);
- result.show();
Python:- from pyspark.ml.feature import ChiSqSelector
- from pyspark.ml.linalg import Vectors
-
- df = spark.createDataFrame([
- (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
- (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
- (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])
-
- selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
- outputCol="selectedFeatures", labelCol="clicked")
-
- result = selector.fit(df).transform(df)
- result.show()
0 0