机器学习项目总结--Display Advertising Challenge
来源:互联网 发布:新页软件破解 编辑:程序博客网 时间:2024/05/01 19:30
CriteoLabs 2014年7月份在kaggle上发起了一次关于展示广告点击率的预估比赛。获得比赛第一名的是号称”3 Idiots”的三个台湾人,最近研究了一下他们的开源的比赛代码,在此分享一下他们的思路。这个代码非常适合机器学习初学者研究一下,尤其对于跨行想做机器学习,但是这之前又没有做过相关的项目。从数据的处理到模型算法的选择,都非常的详细。读完这个代码,大家一定会对机器学习在工业上的应用稍有了解。
在此,我们从数据集开始一步一步的分析整个算法的流程,中间会结合着代码进行解读!
1–数据集
Label - 1和0分别代表了广告是否被点击了
I1-I13 - 这13列代表的是数值型的特征.
C1-C26 - categorical型特征,这些特征已经加密,隐藏了原始的含义。
2–数据清洗处理
2.1- 统计训练数据中categorical类型特征数目,将出现次数大于十次(这个次数是可以自己设定的)以上的特征记录下来,执行的脚本如下:
cmd = ‘./utils/count.py tr.csv > fc.trva.t10.txt’
count.py的内容如下:
#统计categorical特征的数量import argparse, csv, sys, collectionsfrom common import *if len(sys.argv) == 1: sys.argv.append('-h')parser = argparse.ArgumentParser()parser.add_argument('csv_path', type=str)args = vars(parser.parse_args())counts = collections.defaultdict(lambda : [0, 0, 0]) #括号里面的参数代表当map中的键为空的时候,返回括号里面的函数值for i, row in enumerate(csv.DictReader(open(args['csv_path'])), start=1):#start代表开始的索引从1开始,即i的值从1开始计数 label = row['Label'] for j in range(1, 27): field = 'C{0}'.format(j) value = row[field] if label == '0': counts[field+','+value][0] += 1 else: counts[field+','+value][1] += 1 counts[field+','+value][2] += 1 if i % 1000000 == 0: sys.stderr.write('{0}m\n'.format(int(i/1000000)))print('Field,Value,Neg,Pos,Total,Ratio')#按照字段的总个数排序for key, (neg, pos, total) in sorted(counts.items(), key=lambda x: x[1][2]): #map.items()将map中的键值组成一个元组放在列表中[('r1', [3, 0, 0])] if total < 10: continue ratio = round(float(pos)/total, 5) print(key+','+str(neg)+','+str(pos)+','+str(total)+','+str(ratio))
执行的完上述的脚本, fc.trva.t10.txt中记录了统计的结果:
2.2- 将训练数据集中数值型的特征(即I1-I13)和categorical特征(C1-C26)分别生成两个文件,下一步作为GBDT程序的输入。在这里面会利用多线程进行处理,数值型特征会生成稠密型的数据,即每一行记录label和对应的特征的value,对于缺失型的数据,作者默认赋值为-10(为什么是这个数字不是很清楚)。对于categorical特征,进行one-hot编码,只将出现次数在百万以上的特征进行记录(我猜作者是在前期对这个进行了统计,只是没有在代码中给出,直接给出了使用哪些特征)。将执行的脚本如下:
cmd = ‘converters/parallelizer-a.py -s {nr_thread} converters/pre-a.py tr.csv tr.gbdt.dense tr.gbdt.sparse’.format(nr_thread=NR_THREAD)
#parallelizer-a.py文件import argparse, sysfrom common import *def parse_args(): if len(sys.argv) == 1: sys.argv.append('-h') parser = argparse.ArgumentParser() parser.add_argument('-s', dest='nr_thread', default=12, type=int) parser.add_argument('cvt_path') parser.add_argument('src_path') parser.add_argument('dst1_path') parser.add_argument('dst2_path') args = vars(parser.parse_args()) return argsdef main(): args = parse_args() nr_thread = args['nr_thread'] #将原始文件分割成小文件 split(args['src_path'], nr_thread, True) #分割gbdt的文件 parallel_convert(args['cvt_path'], [args['src_path'], args['dst1_path'], args['dst2_path']], nr_thread) cat(args['dst1_path'], nr_thread) cat(args['dst2_path'], nr_thread) delete(args['src_path'], nr_thread) delete(args['dst1_path'], nr_thread) delete(args['dst2_path'], nr_thread)main()
#pre-a.py文件import argparse, csv, sysfrom common import *if len(sys.argv) == 1: sys.argv.append('-h')parser = argparse.ArgumentParser()parser.add_argument('csv_path', type=str)parser.add_argument('dense_path', type=str)parser.add_argument('sparse_path', type=str)args = vars(parser.parse_args())#生成稠密和稀疏矩阵#These features are dense enough (they appear in the dataset more than 4 million times), so we include them in GBDTtarget_cat_feats = ['C9-a73ee510', 'C22-', 'C17-e5ba7672', 'C26-', 'C23-32c7478e', 'C6-7e0ccccf', 'C14-b28479f6', 'C19-21ddcdc9', 'C14-07d13a8f', 'C10-3b08e48b', 'C6-fbad5c96', 'C23-3a171ecb', 'C20-b1252a9d', 'C20-5840adea', 'C6-fe6b92e5', 'C20-a458ea53', 'C14-1adce6ef', 'C25-001f3601', 'C22-ad3062eb', 'C17-07c540c4', 'C6-', 'C23-423fab69', 'C17-d4bb7bd8', 'C2-38a947a1', 'C25-e8b83407', 'C9-7cc72ec2']with open(args['dense_path'], 'w') as f_d, open(args['sparse_path'], 'w') as f_s: for row in csv.DictReader(open(args['csv_path'])): #处理数值特征 feats = [] for j in range(1, 14): val = row['I{0}'.format(j)] if val == '': val = -10 # TODO 为啥缺失数据补值为-10 feats.append('{0}'.format(val)) f_d.write(row['Label'] + ' ' + ' '.join(feats) + '\n') #处理categorical特征 cat_feats = set() for j in range(1, 27): field = 'C{0}'.format(j) key = field + '-' + row[field] cat_feats.add(key) feats = [] for j, feat in enumerate(target_cat_feats, start=1): if feat in cat_feats: feats.append(str(j)) f_s.write(row['Label'] + ' ' + ' '.join(feats) + '\n')
这里面用到了一个common.py ,这是一个公共类,后面其他文件还会用到,在这先贴出来
import hashlib, csv, math, os, pickle, subprocessHEADER="Id,Label,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,I11,I12,I13,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26"def open_with_first_line_skipped(path, skip=True): f = open(path) if not skip: return f next(f) #将文件向下读取一行 return f#计算特征的MD5值def hashstr(str, nr_bins): return int(hashlib.md5(str.encode('utf8')).hexdigest(), 16)%(nr_bins-1)+1#处理特征#feat=['I1-SP1', 'I2-SP1', 'I3-2', 'I4-SP0', 'I5-52', 'I6-1', 'I7-7', 'I8-SP2', 'I9-27', 'I10-SP1', # 'I11-SP2', 'I12-', 'I13-SP2', 'C1-68fd1e64', 'C2-80e26c9b', 'C3-fb936136', 'C4-7b4723c4', 'C5-25c83c98', 'C6-7e0ccccf', 'C7-de7995b8', 'C8-1f89b562', 'C9-a73ee510', 'C10-a8cd5504', 'C11-b2cb9c98', 'C12-37c9c164', 'C13-2824a5f6', 'C14-1adce6ef', # 'C15-8ba8b39a', 'C16-891b62e7', 'C17-e5ba7672', 'C18-f54016b9', 'C19-21ddcdc9', 'C20-b1252a9d', 'C21-07b5194c', 'C22-', 'C23-3a171ecb', 'C24-c5c50484', 'C25-e8b83407', 'C26-9727dd16']def gen_feats(row): feats = [] for j in range(1, 14): field = 'I{0}'.format(j) value = row[field] if value != '': value = int(value) if value > 2: #数值特征中,值大于2的进行对数处理 value = int(math.log(float(value))**2) else: value = 'SP'+str(value) key = field + '-' + str(value) feats.append(key) for j in range(1, 27): field = 'C{0}'.format(j) value = row[field] key = field + '-' + value feats.append(key) return feats#计算经常出现的特征def read_freqent_feats(threshold=10): frequent_feats = set() for row in csv.DictReader(open('fc.trva.t10.txt')): if int(row['Total']) < threshold: continue frequent_feats.add(row['Field']+'-'+row['Value']) return frequent_feats###将文件根据线程的个数分割成小的文件def split(path, nr_thread, has_header): #将原始的文件切片分割成每个进程要读取的文件 def open_with_header_witten(path, idx, header): f = open(path+'.__tmp__.{0}'.format(idx), 'w') if not has_header: return f f.write(header) return f #计算每个进程计算的行数 def calc_nr_lines_per_thread(): #wc -l 统计文件的行数 nr_lines = int(list(subprocess.Popen('wc -l {0}'.format(path), shell=True, stdout=subprocess.PIPE).stdout)[0].split()[0]) if not has_header: nr_lines += 1 return math.ceil(float(nr_lines)/nr_thread) header = open(path).readline()#读取表头 nr_lines_per_thread = calc_nr_lines_per_thread() idx = 0 f = open_with_header_witten(path, idx, header) #将原始文件分割成小文件 for i, line in enumerate(open_with_first_line_skipped(path, has_header), start=1): if i%nr_lines_per_thread == 0: f.close() idx += 1 f = open_with_header_witten(path, idx, header) f.write(line) f.close()#处理特征,将categorical特征进行one-hot编码def parallel_convert(cvt_path, arg_paths, nr_thread): workers = [] for i in range(nr_thread): cmd = '{0}'.format(os.path.join('.', cvt_path)) #拼接路径 for path in arg_paths: #[args['src_path'], args['dst1_path'], args['dst2_path']] cmd += ' {0}'.format(path+'.__tmp__.{0}'.format(i)) worker = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) workers.append(worker) for worker in workers: worker.communicate()#将多线程生成的文件合并到一个文件中def cat(path, nr_thread): if os.path.exists(path): os.remove(path) for i in range(nr_thread): cmd = 'cat {svm}.__tmp__.{idx} >> {svm}'.format(svm=path, idx=i) p = subprocess.Popen(cmd, shell=True) p.communicate()#删除生成的中间临时文件def delete(path, nr_thread): for i in range(nr_thread): os.remove('{0}.__tmp__.{1}'.format(path, i))
经过上面的处理,会得到tr.gbdt.dense和tr.gbdt.sparse两个文件,我们在这贴出其中一部分数据,方便大家的理解。
2.3- 利用GBDT算法,进行特征扩维。通过构造30课深度为7的CART树,这样将特征空间扩充到
cmd = ‘./gbdt -t 30 -s {nr_thread} te.gbdt.dense te.gbdt.sparse tr.gbdt.dense tr.gbdt.sparse te.gbdt.out tr.gbdt.out’.format(nr_thread=NR_THREAD)
执行完的部分结果如下:
2.4- 生成FFM的特征,将原来每个impression的 13(numerical)+26(categorical)+30(GBDT)=69个特征生成FFM认识的数据格式,在处理feature的时候,作者没有使用手工编码,通过了hashcode编码解决了特征编码的问题,这个在代码里面可以清楚的看到。的脚本如下:
cmd = ‘converters/parallelizer-b.py -s {nr_thread} converters/pre-b.py tr.csv tr.gbdt.out tr.ffm’.format(nr_thread=NR_THREAD)
#parallelizer-b.pyimport argparse, sysfrom common import *def parse_args(): if len(sys.argv) == 1: sys.argv.append('-h') parser = argparse.ArgumentParser() parser.add_argument('-s', dest='nr_thread', default=12, type=int) parser.add_argument('cvt_path') parser.add_argument('src1_path') #tr.csv #{nr_thread} converters/pre-b.py tr.csv tr.gbdt.out tr.ffm parser.add_argument('src2_path') # tr.gbdt.out parser.add_argument('dst_path') #tr.ffm args = vars(parser.parse_args()) return argsdef main(): args = parse_args() nr_thread = args['nr_thread'] split(args['src1_path'], nr_thread, True) split(args['src2_path'], nr_thread, False) parallel_convert(args['cvt_path'], [args['src1_path'], args['src2_path'], args['dst_path']], nr_thread) cat(args['dst_path'], nr_thread) delete(args['src1_path'], nr_thread) delete(args['src2_path'], nr_thread) delete(args['dst_path'], nr_thread)main()
#pre-b.pyimport argparse, csv, sysfrom common import *if len(sys.argv) == 1: sys.argv.append('-h')from common import *parser = argparse.ArgumentParser()parser.add_argument('-n', '--nr_bins', type=int, default=int(1e+6))parser.add_argument('-t', '--threshold', type=int, default=int(10))parser.add_argument('csv_path', type=str)parser.add_argument('gbdt_path', type=str)parser.add_argument('out_path', type=str)args = vars(parser.parse_args())##################feats=['0:40189:1', '1:498397:1', '2:131438:1', '3:947702:1', '4:205745:1', '5:786172:1',# '6:754008:1', '7:514500:1', '8:735727:1', '9:255381:1', '10:756430:1', '11:832677:1',# '12:120252:1', '13:172672:1', '14:398230:1', '15:98079:1', '16:550602:1', '17:397270:1',# '18:182671:1', '19:760878:1', '20:241196:1', '21:198788:1', '22:538959:1', '23:295561:1',# '24:540660:1', '25:391696:1', '26:78061:1', '27:462176:1', '28:433710:1', '29:166818:1',# '30:755327:1', '31:765122:1', '32:382381:1', '33:758475:1', '34:541960:1', '35:979212:1',# '36:345058:1', '37:396665:1', '38:254077:1', '39:578185:1', '40:319016:1', '41:394038:1',# '42:73083:1', '43:939002:1', '44:821103:1', '45:978607:1', '46:205991:1', '47:186960:1',# '48:75897:1', '49:593404:1', '50:746562:1', '51:957901:1', '52:950467:1', '53:617299:1',# '54:5494:1', '55:863412:1', '56:302059:1', '57:728712:1', '58:288818:1', '59:265710:1',# '60:37395:1', '61:629862:1', '62:760652:1', '63:572728:1', '64:384118:1', '65:360730:1',# '66:906348:1', '67:249369:1', '68:748254:1']def gen_hashed_fm_feats(feats, nr_bins): feats = ['{0}:{1}:1'.format(field-1, hashstr(feat, nr_bins)) for (field, feat) in feats] return featsfrequent_feats = read_freqent_feats(args['threshold'])with open(args['out_path'], 'w') as f: for row, line_gbdt in zip(csv.DictReader(open(args['csv_path'])), open(args['gbdt_path'])): feats = [] # feat=['I1-SP1', 'I2-SP1', 'I3-2', 'I4-SP0', 'I5-52', 'I6-1', 'I7-7', 'I8-SP2', 'I9-27', 'I10-SP1', # 'I11-SP2', 'I12-', 'I13-SP2', 'C1-68fd1e64', 'C2-80e26c9b', 'C3-fb936136', 'C4-7b4723c4', 'C5-25c83c98', 'C6-7e0ccccf', 'C7-de7995b8', 'C8-1f89b562', 'C9-a73ee510', 'C10-a8cd5504', 'C11-b2cb9c98', 'C12-37c9c164', 'C13-2824a5f6', 'C14-1adce6ef', # 'C15-8ba8b39a', 'C16-891b62e7', 'C17-e5ba7672', 'C18-f54016b9', 'C19-21ddcdc9', 'C20-b1252a9d', 'C21-07b5194c', 'C22-', 'C23-3a171ecb', 'C24-c5c50484', 'C25-e8b83407', 'C26-9727dd16'] for feat in gen_feats(row): field = feat.split('-')[0] type, field = field[0], int(field[1:])#type 为特征的类型I或C filed为索引1-39 if type == 'C' and feat not in frequent_feats: feat = feat.split('-')[0]+'less' if type == 'C': field += 13 feats.append((field, feat)) #append的内容为元组,(特征的索引,特征对应的值) for i, feat in enumerate(line_gbdt.strip().split()[1:], start=1): field = i + 39 feats.append((field, str(i)+":"+feat)) feats = gen_hashed_fm_feats(feats, args['nr_bins']) f.write(row['Label'] + ' ' + ' '.join(feats) + '\n')
编码完的结果如下:
3–FFM训练
下面是官方文档对数据格式的解释,这样就不难理解作者为啥前期对数据做那样的处理。
It is important to understand the difference between field' and
feature’. For example, if we have a raw data like this:
Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
Here, we have
* 2 fields: Advertiser and Publisher* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC
Usually you will need to build two dictionares, one for field and one for features, like this:
DictField[Advertiser] -> 0DictField[Publisher] -> 1DictFeature[Advertiser-Nike] -> 0DictFeature[Publisher-CNN] -> 1DictFeature[Advertiser-ESPN] -> 2DictFeature[Publisher-BBC] -> 3
Then, you can generate FFM format data:
0 0:0:1 1:1:11 0:2:1 1:3:1
Note that because these features are categorical, the values here are all ones.
关于FFM的训练可以使用官方提供的代码库。这个代码库有个很大的优点就是增量式训练,不需要将数据全部加载到内存中。关于FFM的代码分析阅读,将会在下次的博客中分享。博客中代码的完整注释可以在我GitHub上进行下载。