数据预处理(2)—— One-hot coding 独热编码#分别使用 pandans.dummies 和 sklearn.feature_extraction.DictVectorizer 进行处理

来源:互联网 发布:中国产业生产率数据库 编辑:程序博客网 时间:2024/06/13 04:06

离散 feature 的 encoding 分为两种情况:

1、离散 feature 的取值之间没有大小的意义,比如color:[red,blue],那么就使用 one-hot encoding

2、离散 feature 的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}


In [90]:
 
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
np.set_printoptions(precision=4)
×
In [91]:
 
df = pd.DataFrame([  
            ['green', 'M', 10.1, 'class1'],   
            ['red', 'L', 13.5, 'class2'],   
            ['blue', 'XL', 15.3, 'class1']])  
df.columns = ['color', 'size', 'prize', 'class label']  
df
×
Out[91]:
color size prize class label 0 green M 10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1
In [92]:
 
size_mapping = {  
           'XL': 3,  
           'L': 2,  
           'M': 1}  
df['size'] = df['size'].map(size_mapping) 
df
×
Out[92]:
color size prize class label 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1
In [93]:
 
# -----------------------------------------------
# 使用 pd.get_dummies() 进行处理
pd.get_dummies(df)
×
Out[93]:
size prize color_blue color_green color_red class label_class1 class label_class2 0 1 10.1 0 1 0 1 0 1 2 13.5 0 0 1 0 1 2 3 15.3 1 0 0 1 0
In [94]:
 
df
×
Out[94]:
color size prize class label 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1
In [95]:
x
# -----------------------------------------------
# 使用  sklearn.feature_extraction.DictVectorizer 进行处理
feature_list = []
label_list = []
for row in df.index[:]:
    label_list.append(df.ix[row][-1])
    rowDict = {}
    for i in range(0, len(df.ix[row])-1):
        rowDict[df.columns[i]] = df.ix[row][i]
    feature_list.append(rowDict)
feature_list
×
Out[95]:
[{'color': 'green', 'prize': 10.1, 'size': 1}, {'color': 'red', 'prize': 13.5, 'size': 2}, {'color': 'blue', 'prize': 15.300000000000001, 'size': 3}]
In [96]:
 
label_list
×
Out[96]:
['class1', 'class2', 'class1']
In [97]:
 
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
# DictVectorizer.fit_transform() 接受一个由 dict 组成的 list
dummy_x = vec.fit_transform(feature_list).toarray()
dummy_x
×
Out[97]:
array([[  0. ,   1. ,   0. ,  10.1,   1. ],       [  0. ,   0. ,   1. ,  13.5,   2. ],       [  1. ,   0. ,   0. ,  15.3,   3. ]])
In [98]:
 
from sklearn import preprocessing
label_bin = preprocessing.LabelBinarizer()
# preprocessing.LabelBinarizer.fit_transform() 接受一个 list
dummy_y = label_bin.fit_transform(label_list)
dummy_y
×
Out[98]:
array([[0],       [1],       [0]])
In [99]:
 
# 测试 当 label 种类大于 2 的时候的效果
df['class label'][2] = 'class3'
df
×
C:\Users\rHotD\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  from ipykernel import kernelapp as app
Out[99]:
color size prize class label 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class3
In [100]:
 
feature_list = []
label_list = []
for row in df.index[:]:
    label_list.append(df.ix[row][-1])
    rowDict = {}
    for i in range(0, len(df.ix[row])-1):
        rowDict[df.columns[i]] = df.ix[row][i]
    feature_list.append(rowDict)
dummy_y = label_bin.fit_transform(label_list)
dummy_y
×
Out[100]:
array([[1, 0, 0],       [0, 1, 0],       [0, 0, 1]])
In [ ]:
 
# 结论,两者效果差不多一样,但是 pd.get_dummies 更好用一些
×
阅读全文
0 0