machine learning in coding(python):pandas数据包DataFrame数据结构简介
来源:互联网 发布:java中类和对象的关系 编辑:程序博客网 时间:2024/05/22 03:02
导入模块:
import pandas as pdimport numpy as np #pandas依赖于numpyfrom sklearn import preprocessingimport xgboost as xgb
常用功能简介:
#load train and test train = pd.read_csv('train.csv', index_col=0)#index_col=0,指明第1列是索引test = pd.read_csv('test.csv', index_col=0)#type(train)=pandas.core.frame.DataFrame(本质是hash)#train.head(n),获取train前n行的数据#train.head(0),若n=0,表示获取整个train数据#train.tail(n),获取train后n行的数据#train.describe(),获取train的统计信息,如下:''' Hazard T1_V1 T1_V2 T1_V3 T1_V10 count 50999.000000 50999.000000 50999.000000 50999.000000 50999.000000 mean 4.022785 9.722093 12.847585 3.186004 7.020451 std 4.021194 5.167943 6.255743 1.739369 3.595279 min 1.000000 1.000000 1.000000 1.000000 2.000000 25% 1.000000 6.000000 7.000000 2.000000 3.000000 50% 3.000000 9.000000 14.000000 3.000000 8.000000 75% 5.000000 14.000000 18.000000 4.000000 8.000000 max 69.000000 19.000000 24.000000 9.000000 12.000000 '''#train.T,获取train的转置(但注意train本身不变)#train本身不变,可通过train.shape、(train.T).shape验证;#train.shape,获取train的samples个数(不包括列名称)和features个数(不包括行索引)#train.sort(columns='Hazard'),根据某一列排序(但注意train本身不变),默认升序;#可以设置参数ascending=false来降序;设置inplace=True来改变train本身labels = train.Hazard #获取列名称为Hazard的整列数据train.drop('Hazard', axis=1, inplace=True)#inplace=True,直接在源数据train上删除(train本身变化,少了一列数据)#若要保持源数据(train本身不变),可使用:#new_train=train.drop('Hazard', axis=1, inplace=False)train_s = traintest_s = testtrain_s.drop('T2_V10', axis=1, inplace=True)train_s.drop('T2_V7', axis=1, inplace=True)train_s.drop('T1_V13', axis=1, inplace=True)train_s.drop('T1_V10', axis=1, inplace=True)test_s.drop('T2_V10', axis=1, inplace=True)test_s.drop('T2_V7', axis=1, inplace=True)test_s.drop('T1_V13', axis=1, inplace=True)test_s.drop('T1_V10', axis=1, inplace=True)columns = train.columnstest_ind = test.index#train.columns,获取列名称#train.index,获取行索引#convert to numpy array before trainingtrain_s = np.array(train_s)test_s = np.array(test_s)#label encode the categorical variables before trainingfor i in range(train_s.shape[1]): lbl = preprocessing.LabelEncoder() lbl.fit(list(train_s[:,i]) + list(test_s[:,i])) train_s[:,i] = lbl.transform(train_s[:,i]) test_s[:,i] = lbl.transform(test_s[:,i])#convert to float before training using xgboosttrain_s = train_s.astype(float)test_s = test_s.astype(float)preds1 = <strong><span style="color:#ff0000;">xgboost_pred</span></strong>(train_s,labels,test_s)(code from kaggle)
关于xgboost_pred的代码分析,参考:http://blog.csdn.net/mmc2015/article/details/47304779
1 0
- machine learning in coding(python):pandas数据包DataFrame数据结构简介
- machine learning in coding(python):polynomial curve fitting,python拟合多项式
- machine learning in coding(python):根据关键字合并多个表(构建组合feature)
- machine learning in coding(python):使用xgboost构建预测模型
- machine learning in coding(python):拼接原始数据;生成高次特征
- machine learning in coding(python):使用贪心搜索【进行特征选择】
- machine learning in coding(python):使用交叉验证【选择模型超参数】
- Pandas 数据结构(Series,DataFrame)
- Machine Learning in Python
- python-pandas-Series和DataFrame数据结构构建
- Pandas数据结构-DataFrame
- pandas 数据结构之DataFrame
- Learning Scikit-learn Machine Learning in Python
- Python:Pandas:DataFrame基础(1)
- Python:Pandas:DataFrame基础(2)
- Python:Pandas:DataFrame基础(3)
- Python Pandas常用数据结构Series和DataFrame的相关操作
- scikit-learn: machine learning in Python系列(一)
- [PyQt] PyQt4写的音乐播放器
- 二叉树的分层遍历
- 最短路算法
- YUV格式学习:YUYV、YVYU、UYVY、VYUY格式转换成RGB24
- Leetcode: House Robber
- machine learning in coding(python):pandas数据包DataFrame数据结构简介
- 关于校园网出现“感叹号”而不能上网的解决方案。
- 如何提升Visual Studio 2010 的速度
- leetcode Compare Version Numbers版本号比较
- 宽带离网用户分析(6) 不平衡学习
- source-php-usort
- mysql处理高并发数据,防止数据超读
- linux命令——tar
- Windows command中的Git代理设置