[Sklearn应用2] Preprocessing data （二）Binarization 二分化

来源：互联网发布：多得美工学院b段班编辑：程序博客网时间：2024/06/03 22:18

此内容在sklearn官网地址：http://scikit-learn.org/stable/modules/preprocessing.html#
sklearn版本：0.18.2

Binarization

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM. ——scikit-learn.org

根据设定的阈值将连续的变量离散化，转化成0、1。具有以下优点：

可以用稀疏矩阵表示，节省存储空间，加快计算速度。
可以有效处理 miss_value（NA）

稀疏矩阵（sparse matrix）：零元素数目远远多于非零元素数目，并且非零元素的分布没有规律的矩阵。

from sklearn.preprocessing import Binarizerbi = Binarizer(threshold=0)         # threshold为阈值，>threshold将对应的值设为1，<=threshold设为0bi.fit(X)                           # fit does nothingX_bi = bi.transform(X)

多组分类

pandas.cut()  # 分组内容需为数值

阅读全文

0 0