machine learning in coding(python):pandas数据包DataFrame数据结构简介

来源:互联网 发布:java中类和对象的关系 编辑:程序博客网 时间:2024/05/22 03:02




导入模块:

import pandas as pdimport numpy as np #pandas依赖于numpyfrom sklearn import preprocessingimport xgboost as xgb



常用功能简介:

#load train and test train  = pd.read_csv('train.csv', index_col=0)#index_col=0,指明第1列是索引test  = pd.read_csv('test.csv', index_col=0)#type(train)=pandas.core.frame.DataFrame(本质是hash)#train.head(n),获取train前n行的数据#train.head(0),若n=0,表示获取整个train数据#train.tail(n),获取train后n行的数据#train.describe(),获取train的统计信息,如下:'''             Hazard         T1_V1         T1_V2         T1_V3        T1_V10  count  50999.000000  50999.000000  50999.000000  50999.000000  50999.000000   mean       4.022785      9.722093     12.847585      3.186004      7.020451   std        4.021194      5.167943      6.255743      1.739369      3.595279   min        1.000000      1.000000      1.000000      1.000000      2.000000   25%        1.000000      6.000000      7.000000      2.000000      3.000000   50%        3.000000      9.000000     14.000000      3.000000      8.000000   75%        5.000000     14.000000     18.000000      4.000000      8.000000   max       69.000000     19.000000     24.000000      9.000000     12.000000   '''#train.T,获取train的转置(但注意train本身不变)#train本身不变,可通过train.shape、(train.T).shape验证;#train.shape,获取train的samples个数(不包括列名称)和features个数(不包括行索引)#train.sort(columns='Hazard'),根据某一列排序(但注意train本身不变),默认升序;#可以设置参数ascending=false来降序;设置inplace=True来改变train本身labels = train.Hazard #获取列名称为Hazard的整列数据train.drop('Hazard', axis=1, inplace=True)#inplace=True,直接在源数据train上删除(train本身变化,少了一列数据)#若要保持源数据(train本身不变),可使用:#new_train=train.drop('Hazard', axis=1, inplace=False)train_s = traintest_s = testtrain_s.drop('T2_V10', axis=1, inplace=True)train_s.drop('T2_V7', axis=1, inplace=True)train_s.drop('T1_V13', axis=1, inplace=True)train_s.drop('T1_V10', axis=1, inplace=True)test_s.drop('T2_V10', axis=1, inplace=True)test_s.drop('T2_V7', axis=1, inplace=True)test_s.drop('T1_V13', axis=1, inplace=True)test_s.drop('T1_V10', axis=1, inplace=True)columns = train.columnstest_ind = test.index#train.columns,获取列名称#train.index,获取行索引#convert to numpy array before trainingtrain_s = np.array(train_s)test_s = np.array(test_s)#label encode the categorical variables before trainingfor i in range(train_s.shape[1]):    lbl = preprocessing.LabelEncoder()    lbl.fit(list(train_s[:,i]) + list(test_s[:,i]))    train_s[:,i] = lbl.transform(train_s[:,i])    test_s[:,i] = lbl.transform(test_s[:,i])#convert to float before training using xgboosttrain_s = train_s.astype(float)test_s = test_s.astype(float)preds1 = <strong><span style="color:#ff0000;">xgboost_pred</span></strong>(train_s,labels,test_s)
(code from kaggle)



关于xgboost_pred的代码分析,参考:http://blog.csdn.net/mmc2015/article/details/47304779




1 0