Kaggle | Santander Product Recommendation比赛总结
来源:互联网 发布:centos etc 文件夹 编辑:程序博客网 时间:2024/06/16 06:18
Santander Product Recommendation应该是我参加kaggle的第一场比赛,虽然这场比赛最终只取得了103/1704(TOP 6%)的成绩,不过这场比赛让我熟悉了kaggle这个平台,结交了一些志同道合的kaggler。在持续的两个月的过程中,我进行了很多的尝试,学到了很多,下面对这场比赛进行个小结。
题目:根据西班牙Santander银行前17个月的商品购买记录以及用户属性去预测2016年6月每个用户最可能购买的商品(给出预测的7个商品,按可能性大小排列,评测指标用MAP@7)。
降低电脑使用内存:这题的数据量很大,在两个月的比赛过程中,本博主的电脑死机不下50次。本博主参加kaggle这题后,买了根8g的内存条把自己的电脑内存扩充到14g,可是还是经常会死机(程序写的太复杂或者应用软件开的过多)。处理数据如果读入一行处理一行,内存的占用率不会太大,可是这样对数据进行分组聚合和一些统计的处理会显得很麻烦。所以由于本博主处理能力有限,每次都是将整个表读入内存再进行处理,虽然很占内存,可是预处理提取特征这些操作会相对容易实现。这里用pandas的read_csv把2g的数据读到内存会占用10g的内存,因此我们读数据时可以通过改变数据读入的类型(将float64设置成int8读入)或者只读入需要的列,来减少内存的使用。另外,在编程过程中应及时删除一些不用的表来释放内存。
缺失值处理:这里只想说我在比赛过程中遇到的一个坑,csv中缺失的数据(为空)在采用不同的读入方式会呈现不同的结果。采用csv.DictReader( )会将csv中的缺失值读为' ',而用pd.read_csv( )会将缺失值读为nan。剩下的在缺失值的处理中要注意的就是数据类型的转换。
模型:这题发布的第一个月,各位kaggler尝试了各种方法模型,效果都不是很好。后来在BreakfastPirate发布了一条论坛介绍了自己的思路后(由于6月是商品销售分布和其它月差别很大,因此使用xgboost多分类用训练集里的6月去预测测试集6月,这样也大大减小了训练集),比赛排行榜上各种上升,各位kaggler都在此思路上进行特征工程来提高自己的分数。可能是本博主水平略差,我将自己能想到的特征都尝试了一遍可是分数依然提高不多,在比赛最后7天,我没有任何突破,名次掉了30多,哎,真的不会啊,每次加完特征都没提升,有时还会下降,我特么都开始怀疑人生了。此外,这题ensemble没什么用。
特征工程:lag-5特征,renta等属性分组购买率,各种sum,一些商品组合的求和及编码。
线下验证:本博主参加了两次kaggle了,每次比赛结束由Public LeaderBoard转到Private LeaderBoard时,名次都会相对下降一些,一方面可能是因为模型的鲁棒性不是很好,还有一点原因可能是因为过多依赖线上结果。因此,我们比赛时应多参考线下结果,不能一味的依赖于线上结果,这样可能会过拟合Public LeaderBoard。
其它比赛思路:Breakfast Pirate
Ps:已经参加完两场kaggle啦,接下来继续参加,继续加油,哈哈哈哈!!!
我的比赛核心代码如下:
''' author:TaoZI date:2016/12/22'''import datetimeimport numpy as npimport pandas as pdimport xgboost as xgbfrom sklearn.cross_validation import KFoldpd.options.mode.chained_assignment = Nonemapping_dict = {'sexo' : {'nan':0,'H':0, 'V':1},'ind_actividad_cliente' : {'nan':0, '0.0':0, '0':0,'1.0':1, '1':1},'segmento' : {'nan':0, '01 - TOP':0, '03 - UNIVERSITARIO':1, '02 - PARTICULARES':2},'ind_nuevo' : {'nan':0, '1.0':0, '1':0, '0.0':1, '0':1 },'tiprel_1mes' : {'nan':0, 'P':0, 'R':0, 'N':0, 'I':1, 'A':2},'indext' : {'nan':0, 'S':0, 'N':1}}target_raw_cols = ['ind_ahor_fin_ult1', 'ind_aval_fin_ult1', 'ind_cco_fin_ult1', 'ind_cder_fin_ult1', 'ind_cno_fin_ult1', 'ind_ctju_fin_ult1', 'ind_ctma_fin_ult1', 'ind_ctop_fin_ult1', 'ind_ctpp_fin_ult1', 'ind_deco_fin_ult1', 'ind_deme_fin_ult1', 'ind_dela_fin_ult1', 'ind_ecue_fin_ult1', 'ind_fond_fin_ult1', 'ind_hip_fin_ult1', 'ind_plan_fin_ult1', 'ind_pres_fin_ult1', 'ind_reca_fin_ult1', 'ind_tjcr_fin_ult1', 'ind_valo_fin_ult1', 'ind_viv_fin_ult1', 'ind_nomina_ult1', 'ind_nom_pens_ult1', 'ind_recibo_ult1']target_cols = target_raw_cols[2:]con_cols = ['ncodpers', 'fecha_dato', 'age', 'antiguedad','renta']cat_cols = mapping_dict.keys()user_cols = con_cols + cat_cols + target_raw_colsNUM_CLASS = 22def getAge(str_age): age = str_age.strip() if age == 'NA' or age == 'nan': age1 = 2 elif float(age) < 20: age1 = 0 elif float(age) < 30: age1 = 1 elif float(age) < 40: age1 = 2 elif float(age) < 50: age1 = 3 elif float(age) < 60: age1 = 4 else: age1 = 5 return age1def getCustSeniority(str_seniority): cust_seniority = str_seniority.strip() if cust_seniority == 'NA' or cust_seniority == 'nan': seniority = 4 elif float(cust_seniority) < 50: seniority = 0 elif float(cust_seniority) < 75: seniority = 1 elif float(cust_seniority) < 100: seniority = 2 elif float(cust_seniority) < 125: seniority = 3 elif float(cust_seniority) < 150: seniority = 4 elif float(cust_seniority) < 175: seniority = 5 elif float(cust_seniority) < 200: seniority = 6 elif float(cust_seniority) < 225: seniority = 7 else: seniority = 8 return senioritydef getRent(str_rent): rent = str_rent.strip() if rent == 'NA' or rent == 'nan': rent1 = 4 elif float(rent) < 45542.97: rent1 = 1 elif float(rent) < 57629.67: rent1 = 2 elif float(rent) < 68211.78: rent1 = 3 elif float(rent) < 78852.39: rent1 = 4 elif float(rent) < 90461.97: rent1 = 5 elif float(rent) < 103855.23: rent1 = 6 elif float(rent) < 120063.00: rent1 = 7 elif float(rent) < 141347.49: rent1 = 8 elif float(rent) < 173418.12: rent1 = 9 elif float(rent) < 234687.12: rent1 = 10 else: rent1 = 11 return rent1def add_com_features(lag_feats): lag_feats['prod_sum'] = lag_feats.apply(lambda x: np.sum(x[-120:]), axis = 1) for i, pre in enumerate(['1_', '2_', '3_', '4_', '5_']): pre_cols = [pre + col for col in target_raw_cols] lag_feats['sum_24_' + str(i + 1)] = lag_feats.loc[:, pre_cols].sum(axis = 1) sum_24_list = ['sum_24_' + str(i + 1) for i in range(5)] lag_feats['sum_24_max'] = lag_feats[sum_24_list].max(axis = 1) lag_feats['sum_24_min'] = lag_feats[sum_24_list].min(axis = 1) lag_feats['sum_24_mean'] = lag_feats[sum_24_list].mean(axis = 1) for i, col in enumerate(target_raw_cols): index_list = [pre + col for pre in ['1_', '2_', '3_', '4_', '5_']] lag_feats['prod_sum_' + str(i)] = lag_feats.loc[:, index_list].sum(axis = 1) pro_sum_list = ['prod_sum_' + str(i) for i in range(24)] for gp_col in ['renta', 'sexo']: group_feats = lag_feats[pro_sum_list].groupby(lag_feats[gp_col]).agg(lambda x: round(x.sum() / x.count(), 2)) group_feats.columns = [gp_col + str(i) for i in range(24)] lag_feats = pd.merge(lag_feats, group_feats, left_on = gp_col, right_index = True, how = 'left') com_col = [[0, 2], [7, 8, 9], [9, 10, 11], [19, 20, 21]] for x in range(4): import_col = [target_cols[i] for i in com_col[x]] for i in range(1, 6): pre_import_col = [str(i) + '_' + col for col in import_col] lag_feats[str(i) + '_' + str(x + 1) + '_s_sum_import'] = lag_feats[pre_import_col].sum(axis = 1) return lag_featsdef process_train_data(in_file_name, date_list): this_month = in_file_name[in_file_name['fecha_dato'].isin([date_list[0]])] for col in cat_cols: this_month[col] = this_month[col].apply(lambda x:mapping_dict[col][str(x)]) for col in target_raw_cols: this_month[col].fillna(0, inplace=True) this_month['age'] = this_month['age'].apply(lambda x: getAge(x)) this_month['antiguedad'] = this_month['antiguedad'].apply(lambda x: getCustSeniority(x)) this_month['renta'] = this_month['renta'].apply(lambda x: getRent(str(x))) hist_data = in_file_name.loc[:,['ncodpers','fecha_dato'] + target_raw_cols] del in_file_name pre_month = hist_data[hist_data['fecha_dato'].isin([date_list[1]])] pre_month_ncodpers = pre_month[['ncodpers']] pre_month_target = pre_month[target_raw_cols] pre_month_target = pre_month_target.add_prefix('1_') pre_month = pd.concat([pre_month_ncodpers, pre_month_target], axis=1) this_month = pd.merge(this_month, pre_month, on=['ncodpers'], how='left') this_month.fillna(0, inplace=True) for col in target_cols: this_month[col] = np.where(this_month[col]-this_month['1_'+col] > 0,(this_month[col]-this_month['1_'+col]), 0 ) this_month_target = this_month[target_cols] this_month = this_month.drop(target_raw_cols, axis=1) x_vars_list = [] y_vars_list = [] for i in range(2, len(date_list)): tmp = hist_data[hist_data['fecha_dato'].isin([date_list[i]])].loc[:,['ncodpers'] + target_raw_cols] tmp = tmp.add_prefix(str(i) + "_") tmp.rename(columns={str(i) + '_ncodpers': 'ncodpers'}, inplace=True) this_month = pd.merge(this_month, tmp, on=['ncodpers'], how='left') this_month.fillna(0, inplace=True) del hist_data this_month = add_com_features(this_month) this_month.fillna(0, inplace=True) this_month = pd.concat([this_month,this_month_target],axis=1) for idx,row in this_month.iterrows(): for i in range(0,22): if row[(-22+i)]>0: x_vars_list.append(row[:-22]) y_vars_list.append(i) return np.array(x_vars_list), np.array(y_vars_list)def process_test_data(test_file, hist_file, date_list): for col in cat_cols: test_file[col] = test_file[col].apply(lambda x: mapping_dict[col][str(x)]) test_file['age'] = test_file['age'].apply(lambda x: getAge(x)) test_file['antiguedad'] = test_file['antiguedad'].apply(lambda x: getCustSeniority(x)) test_file['renta'] = test_file['renta'].apply(lambda x: getRent(x)) for i in range(0, len(date_list)): tmp = hist_file[hist_file['fecha_dato'].isin([date_list[i]])].loc[:,['ncodpers'] + target_raw_cols] tmp = tmp.add_prefix(str(i + 1) + "_") tmp.rename(columns={str(i + 1) + '_ncodpers': 'ncodpers'}, inplace=True) test_file = pd.merge(test_file, tmp, on=['ncodpers'], how='left') test_file.fillna(0, inplace=True) del hist_file test_file = add_com_features(test_file) test_file.fillna(0, inplace=True) return test_file.valuesdef runXGB_CV(train_X,train_y, test_X, index, seed_val): train_index, test_index = index X_train = train_X[train_index] y_train = train_y[train_index] xgtrain = xgb.DMatrix(X_train, label=y_train) xgtest = xgb.DMatrix(test_X) param = { 'objective' : 'multi:softprob', 'eval_metric' : "mlogloss", 'num_class' : NUM_CLASS, 'silent' : 1, 'min_child_weight' : 2, 'eta': 0.05, 'max_depth': 6, 'subsample' : 0.9, 'colsample_bytree' : 0.8, 'seed' : seed_val } num_rounds = 100 model = xgb.train(param, xgtrain, num_rounds) pred = model.predict(xgtest) return preddef runXGB(train_X, train_y, test_X,seed_val=123): param = { 'objective' : 'multi:softprob', 'eval_metric' : "mlogloss", 'num_class' : NUM_CLASS, 'silent' : 1, 'min_child_weight' : 2, 'eta': 0.05, 'max_depth': 6, 'subsample' : 0.9, 'colsample_bytree' : 0.8, 'seed' : seed_val } num_rounds = 100 xgtrain = xgb.DMatrix(train_X, label = train_y) xgtest = xgb.DMatrix(test_X) model = xgb.train(param, xgtrain, num_rounds) preds = model.predict(xgtest) return predsif __name__ == "__main__": cv_sel = 1 start_time = datetime.datetime.now() data_path = '../input/' print "feature extract..." train_file = pd.read_csv(data_path + 'train_ver3.csv', dtype = {'age': 'str', 'antiguedad': 'str', 'renta': 'str'}, usecols = user_cols) print datetime.datetime.now() - start_time train_X, train_y = process_train_data(train_file, ['2015-06-28', '2015-05-28', '2015-04-28', '2015-03-28', '2015-02-28', '2015-01-28']) train_X = train_X[:, 2:] print datetime.datetime.now() - start_time data_date = ['2016-05-28', '2016-04-28', '2016-03-28', '2016-02-28', '2016-01-28'] train_file = train_file[train_file['fecha_dato'].isin(data_date)].loc[:, ['ncodpers', 'fecha_dato'] + target_raw_cols] test_file = pd.read_csv(data_path + 'test_ver3.csv', dtype = {'age': 'str', 'antiguedad': 'str', 'renta': 'str'}, usecols = con_cols + cat_cols) test_X = process_test_data(test_file, train_file, data_date) print datetime.datetime.now() - start_time del train_file, test_file test_X = test_X[:, 2:] print train_X.shape, train_y.shape, test_X.shape print datetime.datetime.now() - start_time seed_val = 123 if cv_sel == 1: print "running model with cv..." nfolds = 5 kf = KFold(train_X.shape[0], n_folds = nfolds, shuffle = True, random_state = seed_val) preds = [0] * NUM_CLASS for i, index in enumerate(kf): preds += runXGB_CV(train_X, train_y, test_X, index, seed_val) print 'fold %d' % (i + 1) preds = preds / nfolds else: print "running model with feature..." preds = runXGB(train_X, train_y, test_X,seed_val) del train_X, test_X, train_y print "Getting the top products.." target_cols = np.array(target_cols) preds = np.argsort(preds, axis = 1) preds = np.fliplr(preds)[:, :7] test_id = np.array(pd.read_csv( data_path + 'test_ver2.csv', usecols = ['ncodpers'])['ncodpers']) final_preds = [" ".join(list(target_cols[pred])) for pred in preds] out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds}) out_df.to_csv('../submit/sub_xgb.csv', index = False)
- Kaggle | Santander Product Recommendation比赛总结
- 一次失败的kaggle比赛Santander Customer Satisfaction:赛题简介与初次尝试
- Kaggle竞赛Santander 客户满意
- 新手kaggle比赛总结之一
- Kaggle | Allstate Claims Severity比赛总结
- kaggle比赛
- Kaggle比赛-Otto Group Product Classification-打败一半参赛队伍的简单解法
- state-farm-distracted-driver-detection (Kaggle 比赛总结)
- kaggle比赛集成指南
- Kaggle 比赛入门
- kaggle比赛集成指南
- Kaggle比赛知识简要
- Kaggle比赛整理
- Kaggle[4] - Online Product Sales
- 推荐算法总结Recommendation
- python-kaggle比赛资料收集
- kaggle--mnist--knn 比赛入门
- kaggle数据挖掘比赛经验
- shell 脚本kill oracle 数据库指定用户
- 利用本地存储实现记录滚动条的位置
- 欢迎使用CSDN-markdown编辑器
- c语言==笔试题6(25)
- [FineReport]重复与冻结设置
- Kaggle | Santander Product Recommendation比赛总结
- 恩布开源安卓手机IM,EntboostIM发布1.5.1版本
- 第16周OJ-1
- Linux开源系统对比Windows闭源系统的优势解析
- 《SpannableStringBuilder的使用》
- Linux more 命令用法详解
- 欢迎使用CSDN-markdown编辑器
- 文章标题
- 第17周自由入侵 输出班级第n个学生的平均成绩