pandas的factorize()，numpy库unique函数

来源：互联网发布：java 泛型 class 参数编辑：程序博客网时间：2024/06/01 15:28

1、factorize函数可以将Series中的标称型数据映射称为一组数字，相同的标称型映射为相同的数字。

factorize函数的返回值是一个tuple（元组），元组中包含两个元素。

第一个元素是一个array，其中的元素是标称型元素映射为的数字；

第二个元素是Index类型，其中的元素是所有标称型元素，没有重复。

# coding=utf-8import numpy as npimport xgboost as xgbfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport csvfrom pandas import DataFramedf = pd.DataFrame({"id":[1,2,3,4,5,6,3,2], "raw_grade":['a', 'b', 'b','a', 'a', 'e','c','a']})print dfprint '\n'x = pd.factorize(df.raw_grade)print x

结果：

   id raw_grade0   1         a1   2         b2   3         b3   4         a4   5         a5   6         e6   3         c7   2         a(array([0, 1, 1, 0, 0, 2, 3, 0], dtype=int64), Index([u'a', u'b', u'e', u'c'], dtype='object'))

2、numpy库unique函数解析

unique()函数返回参数数组中所有不同的值，并按照从小到大排序

该函数有两个可选参数：

return_index: True 表示unique()后的新数据在原始数组中的下标；

return_inverse :True 表示重建后的数组中各元素对应的下标在原始数组或列表中表示出来；

1）对于一维列表或数组A:

import numpy as npA = [1, 2, 2, 3, 4, 3]a = np.unique(A)print a            # 输出为 [1 2 3 4]a, b, c = np.unique(A, return_index=True, return_inverse=True)print a, b, c      # 输出为 [1 2 3 4], [0 1 3 4], [0 1 1 2 3 2]

注意：上面与下面的不之同之处

A = [4, 2, 2, 3, 1, 3]a = np.unique(A)print a            # 输出为 [1 2 3 4]a, b, c = np.unique(A, return_index=True, return_inverse=True)print a, b, c      # 输出为 [1 2 3 4] [4 1 3 0] [3 1 1 2 0 2]

说明：

c 重建后的列表[1 2 3 4]中各元素对应的下标为：0，1，2，3 ，在原始数组或列表中表示出来[3 1 1 2 0 2]；即4的下标为3，3的下标为2，2的下标为1，1的下标为0.

2）对于二维数组(“darray数字类型”):

A = [[1, 2], [3, 4], [5, 6], [1, 2]]A = np.array(A)   #列表类型需转为数组类型a, b, c = np.unique(A.view(A.dtype.descr * A.shape[1]), return_index=True, return_inverse=True)print a, b, c     #输出为 [(1, 2) (3, 4) (5, 6)], [0 1 2], [0 1 2 0]

阅读全文

0 0