python对字符串型数据处理
来源:互联网 发布:sql降序排列语句 编辑:程序博客网 时间:2024/06/16 19:06
1.sklearn
1.1 labelEncoder
from sklearn import preprocessingle = preprocessing.LabelEncoder()le.fit(df['Col1'])df['Col3'] = le.transform(df['Col3'])
再来一个示例
###from sklearn import preprocessingfrom sklearn.preprocessing import LabelEncoderle = preprocessing.LabelEncoder()le.fit(["paris", "paris", "tokyo", "amsterdam"])LabelEncoder()print(list(le.classes_))# ['amsterdam', 'paris', 'tokyo']print(le.transform(["tokyo", "tokyo", "paris"])) # array([2, 2, 1])
这里结合读取文件,来实现字符编码。
import numpy as npimport pandas as pdimport xlrdfrom tqdm import tqdmfrom sklearn import preprocessingfrom sklearn.preprocessing import LabelEncoder#### obtain cols of XX typedef obtain_x(train_df,xtype): dtype_df = train_df.dtypes.reset_index() print('dtype_df\n',dtype_df) dtype_df.columns = ['col','type'] return dtype_df[dtype_df.type==xtype].col.valuestrain_df = pd.read_excel(r'G:\test_onehot.xlsx')# print('train_df',train_df)# obtain str colsstr_col = obtain_x(train_df,'object')#获得字符串类型列代号print('str_col\n',str_col)str_col_list=str_col.tolist()print('str_list\n',str_col_list)# print('obtained float cols, and count:',len(float64_col))print('train_df[str_col_list]\n',train_df[str_col_list])###编码le = preprocessing.LabelEncoder()# list= [col for col in str_col ]list=[]# list=str_col_list# list.append(train_df[col] for col in str_col_list)list.append(train_df[str_col_list[0]])list.append(train_df[str_col_list[1]])print('list\n',list[1][0])le.fit(list[0])LabelEncoder()print('le.transform(list[0])\n',le.transform(list[0]))
2、使用pandas处理
2.1 独热编码
import pandas as pdtrain_df = pd.read_excel(r'G:\test_onehot.xlsx')# print('train_df',train_df)#get_dummies# obtain str colsstr_col = obtain_x(train_df,'object')#获得字符串类型列代号train_df_dummy=pd.get_dummies(train_df[str_col])train_df=train_df.drop(str_col,axis=1)train_df=train_df.join(train_df_dummy)print('train_df\n',train_df)
参考:
1. pandas处理字符串型数据;
2. sklearn_labelEncoder;
3. 独热编码CSDN;
4. 独热编码_GitHub;
5. 独热编码的两种实现方式panda和sklearn
阅读全文