Kaggle赛题-Synthetic Financial Datasets For Fraud Detection
来源:互联网 发布:淘宝达人佣金给多少 编辑:程序博客网 时间:2024/05/18 18:02
本文主要通过Kaggle中的Synthetic Financial Datasets For Fraud Detection赛题,即金融反欺诈预测来对数据挖掘的过程进行一个较为全面完整的学习理解。本赛题数据总共有六百多万条,包括了银行对每一笔款项的记录。每条数据包含11个字段,分别为转账时长,款项的事件类型,转出账户的前后余额,转入账户的前后余额,是否为欺诈标签以及银行系统模型的欺诈预判标签。
通过对数据的清洗,整理,可视化展示分析,预处理,特征工程等步骤,最后我们使用逻辑回归LogisticRegression算法对数据
进行二分类预测,通过画出ROC曲线,AUC值等,表明本方法实验效果较好。读者也可以跟着代码记录,一步步的执行,查看结果,如此可对数据分析或者机器学习过程有一个大概的了解,本实验使用的逻辑回归算法理论推导部分可以查看逻辑回归算法理解1和逻辑回归算法理解2两篇文章。
项目URL:https://www.kaggle.com/ntnu-testimon/paysim1
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import preprocessingfrom scipy.stats import skew, boxcoximport osdataset_path = 'D:\In\kaggle\PS_20174392719_1491204439457_log.csv'raw_data = pd.read_csv(dataset_path)# 查看数据集信息print('数据预览:')print(raw_data.head())print('数据统计信息:')print(raw_data.describe())print('数据集基本信息:')print(raw_data.info())数据预览: step type amount nameOrig oldbalanceOrg newbalanceOrig \0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 2 1 TRANSFER 181.00 C1305486145 181.0 0.00 3 1 CASH_OUT 181.00 C840083671 181.0 0.00 4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud 0 M1979787155 0.0 0.0 0 0 1 M2044282225 0.0 0.0 0 0 2 C553264065 0.0 0.0 1 0 3 C38997010 21182.0 0.0 1 0 4 M1230701703 0.0 0.0 0 0 数据统计信息: step amount oldbalanceOrg newbalanceOrig \count 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 mean 2.433972e+02 1.798619e+05 8.338831e+05 8.551137e+05 std 1.423320e+02 6.038582e+05 2.888243e+06 2.924049e+06 min 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 25% 1.560000e+02 1.338957e+04 0.000000e+00 0.000000e+00 50% 2.390000e+02 7.487194e+04 1.420800e+04 0.000000e+00 75% 3.350000e+02 2.087215e+05 1.073152e+05 1.442584e+05 max 7.430000e+02 9.244552e+07 5.958504e+07 4.958504e+07 oldbalanceDest newbalanceDest isFraud isFlaggedFraud count 6.362620e+06 6.362620e+06 6.362620e+06 6.362620e+06 mean 1.100702e+06 1.224996e+06 1.290820e-03 2.514687e-06 std 3.399180e+06 3.674129e+06 3.590480e-02 1.585775e-03 min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 25% 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 50% 1.327057e+05 2.146614e+05 0.000000e+00 0.000000e+00 75% 9.430367e+05 1.111909e+06 0.000000e+00 0.000000e+00 max 3.560159e+08 3.561793e+08 1.000000e+00 1.000000e+00 数据集基本信息:<class 'pandas.core.frame.DataFrame'>RangeIndex: 6362620 entries, 0 to 6362619Data columns (total 11 columns):step int64type objectamount float64nameOrig objectoldbalanceOrg float64newbalanceOrig float64nameDest objectoldbalanceDest float64newbalanceDest float64isFraud int64isFlaggedFraud int64dtypes: float64(5), int64(3), object(3)memory usage: 534.0+ MBNoneprint('转账类型记录统计:')print(raw_data['type'].value_counts()) #type特征列 各转账类型 数量统计 fig, ax = plt.subplots(1, 1, figsize=(8, 4))raw_data['type'].value_counts().plot(kind='bar', title='Transaction Type', ax=ax, figsize=(8, 4))plt.show()转账类型记录统计:CASH_OUT 2237500PAYMENT 2151495CASH_IN 1399284TRANSFER 532909DEBIT 41432Name: type, dtype: int64
# 查看转账类型和欺诈标记的记录ax = raw_data.groupby(['type', 'isFraud']).size().plot(kind='bar') #以type isFraud分组统计 .size()类似pandas的透视表ax.set_title('# of transactions vs (type + isFraud)')ax.set_xlabel('(type, isFraud)')ax.set_ylabel('# of transaction')# 添加标注for p in ax.patches: ax.annotate(str(format(int(p.get_height()), ',d')), (p.get_x(), p.get_height()*1.01)) #顶部加注释 千分位 注释的xy坐标plt.show()
# 查看转账类型和商业模型标记的欺诈记录ax = raw_data.groupby(['type', 'isFlaggedFraud']).size().plot(kind='bar') #分组统计 每一种type类型中,统计0、1分别有多少个ax.set_title('# of transactions vs (type + isFlaggedFraud)')ax.set_xlabel('(type, isFlaggedFraud)')ax.set_ylabel('# of transaction')# 添加标注for p in ax.patches: ax.annotate(str(format(int(p.get_height()), ',d')), (p.get_x(), p.get_height()*1.01))
接下来对数据进行探索性的展现和分析! 不得不说seaborn真的很强大呀!
fig, axs = plt.subplots(2, 2, figsize=(10, 10)) transfer_data = raw_data[raw_data['type'] == 'TRANSFER'] #TRANSFER类型是我们重点关注的对象 需要单独拿出来展现、查看、分析!a = sns.boxplot(x='isFlaggedFraud', y='amount', data=transfer_data, ax=axs[0][0]) #箱图 上下四分位 中位数axs[0][0].set_yscale('log') #查看的是转账金额与系统是否标注为欺诈 之间的关系,通过数据可视化发现被标注为欺诈的转账金额往往较高。b = sns.boxplot(x='isFlaggedFraud', y='oldbalanceDest', data=transfer_data, ax=axs[0][1]) #目标账户原先的余额 系统是否标注为欺诈之间的关系 欺诈的原先账户余额往往较少 axs[0][1].set(ylim=(0, 0.5e8)) # ylim限制y轴的范围c = sns.boxplot(x='isFlaggedFraud', y='oldbalanceOrg', data=transfer_data, ax=axs[1][0]) #向外转账的账户原先的余额 与系统是否标注为欺诈之间的关系axs[1][0].set(ylim=(0, 3e7)) #箱图的结果基本符合主观常识d = sns.regplot(x='oldbalanceOrg', y='amount', data=transfer_data[transfer_data['isFlaggedFraud'] ==1], ax=axs[1][1])#线性关系?原先账户的余额越多转出的就越多?plt.show()
used_data = raw_data[(raw_data['type'] == 'TRANSFER') | (raw_data['type'] == 'CASH_OUT')] #只保留了行数据TRANSFER 和 CASH_OUT类型 used_data.drop(['step', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1, inplace=True) #丢掉没用的特征数据列# 重新设置索引 used_data = used_data.reset_index(drop=True)#将type转换成类别数据,即0, 1type_label_encoder = preprocessing.LabelEncoder() 数据预处理 type_category = type_label_encoder.fit_transform(used_data['type'].values)used_data['typeCategory'] = type_categoryused_data.head()type amount oldbalanceOrg newbalanceOrig oldbalanceDest \0 TRANSFER 181.00 181.0 0.0 0.0 1 CASH_OUT 181.00 181.0 0.0 21182.0 2 CASH_OUT 229133.94 15325.0 0.0 5083.0 3 TRANSFER 215310.30 705.0 0.0 22425.0 4 TRANSFER 311685.89 10835.0 0.0 6267.0 newbalanceDest isFraud typeCategory 0 0.00 1 1 1 0.00 1 0 2 51513.44 0 0 3 0.00 0 1 4 2719172.89 0 1 In [47]: sns.heatmap(used_data.corr()) #不同特征列之间的相关性Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x22407f999b0>In [48]: plt.show()
ax=used_data['type'].value_counts().plot(kind='bar',title="Transaction Type",figsize=(6,6)) #统计各有多少个 ...: for p in ax.patches: ...: ax.annotate(str(format(int(p.get_height()),',d')),(p.get_x(),p.get_height()*1.01)) #后面参数为注释所在xy坐标 ...: plt.show()
ax=pd.value_counts(used_data['isFraud'],sort=True).sort_index().plot(kind='bar',title="Fraud Transaction Count") #统计现在数据中各有多少个 ...: for p in ax.patches: ...: ax.annotate(str(format(int(p.get_height()),',d')),(p.get_x(),p.get_height())) 我们发现欺诈和非欺诈数据严重失衡 ...: plt.show()
In [61]: xx=pd.value_counts(used_data['isFraud'],sort=True)In [62]: type(xx)Out[62]: pandas.core.series.SeriesIn [63]: xx.head()Out[63]: 0 27621961 8213Name: isFraud, dtype: int64In [64]: xxOut[64]: 0 27621961 8213Name: isFraud, dtype: int64In [65]: xx=pd.value_counts(used_data['isFraud'],sort=True).sort_index() 加这个sort_index()似乎没变化啊?In [66]: xxOut[66]: 0 27621961 8213Name: isFraud, dtype: int64In [66]: In [67]: pd.value_counts(used_data['isFraud'])Out[67]: 0 27621961 8213Name: isFraud, dtype: int64In [67]: 我们发现正样本的数量相对负样本来说特别少,数据不平衡(这样训练出来的模型只能对负样本有较高的准确率,而正样本的准确率可能很低)所以我们需要降采样,即将负样本减少的跟正样本量差不多In [68]: feature_names=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest','typeCategory'] ...: X=used_data[feature_names] ...: Y=used_data['isFraud'] ...: X.head() ...: Y.head() ...: Out[68]: 0 11 12 03 04 0Name: isFraud, dtype: int64In [69]: X.head()Out[69]: amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest \0 181.00 181.0 0.0 0.0 0.00 1 181.00 181.0 0.0 21182.0 0.00 2 229133.94 15325.0 0.0 5083.0 51513.44 3 215310.30 705.0 0.0 22425.0 0.00 4 311685.89 10835.0 0.0 6267.0 2719172.89 typeCategory 0 1 1 0 2 0 3 1 4 1 In [70]: number_records_fraud=len(used_data[used_data['isFraud']==1])In [71]: number_records_fraud 欺诈数量8213个Out[71]: 8213In [72]: xx=used_data['isFraud']==1In [73]: type(xx)Out[73]: pandas.core.series.SeriesIn [74]: xxOut[74]: 0 True1 True2 False3 False4 False5 False6 False7 False8 False9 False10 False11 False12 False13 False14 False15 False16 False17 False18 False19 False20 False21 False22 False23 False24 False25 False26 False27 False28 False29 False ... 2770379 True2770380 True2770381 True2770382 True2770383 True2770384 True2770385 True2770386 True2770387 True2770388 True2770389 True2770390 True2770391 True2770392 True2770393 True2770394 True2770395 True2770396 True2770397 True2770398 True2770399 True2770400 True2770401 True2770402 True2770403 True2770404 True2770405 True2770406 True2770407 True2770408 TrueName: isFraud, Length: 2770409, dtype: boolIn [75]: fraud_indices=used_data[used_data['isFraud']==1].index.values #正样本的索引In [76]: len(fraud_indices)Out[76]: 8213In [77]: fraud_indicesOut[77]: array([ 0, 1, 123, ..., 2770406, 2770407, 2770408], dtype=int64)#这些索引下的为正样本数据In [78]: fraud_indices[:5]Out[78]: array([ 0, 1, 123, 124, 192], dtype=int64)In [79]: nonfraud_indices=used_data[used_data['isFraud']==0].indexIn [80]: nonfraud_indices #负样本的索引Out[80]: Int64Index([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... 2770103, 2770104, 2770105, 2770106, 2770107, 2770108, 2770109, 2770110, 2770111, 2770112], dtype='int64', length=2762196)In [81]: random_nonfraud_indices=np.random.choice(nonfraud_indices,number_records_fraud,replace=False) #在负样本索引当中随机选取8213个索引作为新的负样本!In [82]: random_nonfraud_indices=np.array(random_nonfraud_indices) 新的负样本索引8213In [82]: In [83]: under_sample_indices=np.concatenate([fraud_indices,random_nonfraud_indices]) #新的下采样数据索引!! ...: under_sample_data=used_data.iloc[under_sample_indices,:] ...: ...: X_undersample = under_sample_data[feature_names].values ...: y_undersample = under_sample_data['isFraud'].values ...: ...: # 显示样本比例 ...: print("非欺诈记录比例: ", len(under_sample_data[under_sample_data['isFraud'] == 0]) / len(under_sample_data)) ...: print("欺诈记录比例: ", len(under_sample_data[under_sample_data['isFraud'] == 1]) / len(under_sample_data)) ...: print("欠采样记录数: ", len(under_sample_data)) ...: ...: 非欺诈记录比例: 0.5欺诈记录比例: 0.5欠采样记录数: 16426In [85]: X_train, X_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size=0.3, random_state=0) #7:3拆分 ...: lr_model = LogisticRegression() ...: lr_model.fit(X_train, y_train) ...: y_pred_score = lr_model.predict_proba(X_test) ...: In [86]: y_pred_scoreOut[86]: array([[ 0.50192359, 0.49807641], [ 0.95716076, 0.04283924], [ 0.45924015, 0.54075985], ..., [ 0.98630318, 0.01369682], [ 0.25148841, 0.74851159], [ 0.50527488, 0.49472512]])In [87]: fpr, tpr, thresholds = roc_curve(y_test, y_pred_score[:, 1]) #注意阈值 ...: roc_auc = auc(fpr,tpr) ...: plt.title('Receiver Operating Characteristic') ...: plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc) ...: plt.legend(loc='lower right') ...: plt.plot([0,1],[0,1],'r--') ...: plt.xlim([-0.1,1.0]) ...: plt.ylim([-0.1,1.01]) ...: plt.ylabel('True Positive Rate') ...: plt.xlabel('False Positive Rate') ...: plt.show()
AUC值与ROC曲线:准确率越高越好吗?实际上不一定如此,例如100个样本当中有99个负样本,1个正样本,我们能够预测99个负样本,准确率是99%,但正样本预测准确率则为0,所以单看准确率是不够的,由此我们引入了AUC(area under curve)和ROC的概念。AUC值是ROC曲线下的面积,经常作为二分类的结果评价指标!
TP:真阳性,真实值为1,预测值为1
FP:伪阳性,真实值为0,预测值为1
TN:真阴性,真实值为0,预测值为0
FN:伪阴性,真实值为1,预测值为0
TPR代表在所有正样本中,即实际标签为1的样本中,最终被预测为1的比率;
FPR代表在所有负样本中,即实际标签为0的样本中,最终被预测为1的比率;
ROC曲线越靠近左上角,说明正样本更多的被预测为了1,负样本更多的没有被预测为1即更多的被预测为了0,则说明模型的预测效果越好!
ROC曲线上的每一个点对应于一个threshold阈值,对应于一个分类器,每个threshold下会有一个TPR和FPR。比如Threshold最大时,TP=FP=0,对应于原点;Threshold最小时,TN=FN=1,对应于右上角的点(1,1)。随着阈值theta增加,TP和FP都减小,TPR和FPR也减小,ROC点向左下移动;
- Kaggle赛题-Synthetic Financial Datasets For Fraud Detection
- 【fraud detection】Data analysis techniques for fraud detection
- 【fraud detection】Managed Click Fraud Detection for Advertisers
- ZOJ 3512 Financial Fraud
- ★【左偏树】Financial Fraud
- 【fraud detection】网址
- credit card fraud detection
- 【fraud detection】Google AdWords & AdSense Pay Per Click Fraud Detection
- 【fraud detection】Dempster-Shafer Theory
- 【分享】Datasets for semi-structured data record detection(半结构化数据记录检测数据集)
- Datasets for Data Mining
- Datasets for Data Minging
- Datasets for MachineLearning
- Datasets for ADAS
- 【fraud detection】如何防止fraud clicks(欺骗点击作弊)的分析
- 【fraud detection】从“秒杀门”看网络防作弊
- 【fraud detection】点击欺诈丛生 中国互联网广告一场骗局
- Credit Card Fraud Detection(信用卡诈欺侦测)Spark建模
- 光盘启动 (Boot from CDROM) Part 2- SakiProject
- 20171015连续第四日总结
- extends作业四
- Raid0、 Raid1、 Raid5、 Raid10
- Java_6
- Kaggle赛题-Synthetic Financial Datasets For Fraud Detection
- 前端开发学习流程
- 深入了解Android蓝牙Bluetooth——《进阶篇》
- Shadowsocks利用 Socat 实现单端口 中继(中转/端口转发)加速
- SF6项目——Login
- Lucene 深入学习(1)全文检索
- php缓冲区探析
- 关于scanf的问题
- 使用Servlet上传多张图片——访问提示