Python机器学习（三）：Logistic回归建模分类实例——信用卡欺诈监测（下）

来源：互联网发布：如何找到域名的ip地址编辑：程序博客网时间：2024/04/29 07:39

Logistic回归建模分类实例——信用卡欺诈监测（下）

上篇博客是用下采样的方式来处理数据，解决样本数据不均衡，从模型的测试结果来看，下采样使得模型的误杀率很高。那现在我们就用过采样来处理数据看看结果如何。
creditcard.csv数据（点此下载）

一开始，还是对数据进行归一化操作，将无用信息删掉。

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom imblearn.over_sampling import SMOTEfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_splitdata = pd.read_csv('creditcard.csv')from sklearn.preprocessing import StandardScalerdata['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1)) data = data.drop(['Amount','Time'], axis=1)

下面是两个函数求best_c和画混淆矩阵，在上篇博客（Logistic回归建模分类实例——信用卡欺诈监测（上））已经介绍过了。

import itertoolsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.cross_validation import KFold, cross_val_scorefrom sklearn.metrics import confusion_matrix,recall_score,classification_report def printing_Kfold_scores(x_train_data,y_train_data):    fold = KFold(len(y_train_data),5,shuffle=False)     c_param_range = [0.01,0.1,1,10,100]    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])    results_table['C_parameter'] = c_param_range    j = 0    for c_param in c_param_range:        print('-------------------------------------------')        print('C parameter: ', c_param)        print('-------------------------------------------')        print('')        recall_accs = []        for iteration, indices in enumerate(fold,start=1):            lr = LogisticRegression(C = c_param, penalty = 'l1')lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)            recall_accs.append(recall_acc)            print('Iteration ', iteration,': recall score = ', recall_acc)        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)        j += 1        print('')        print('Mean recall score ', np.mean(recall_accs))        print('')    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']    print('*********************************************************************************')    print('Best model to choose from cross validation is with C parameter = ', best_c)    print('*********************************************************************************')    return best_cdef plot_confusion_matrix(cm, classes,                          title='Confusion matrix',                          cmap=plt.cm.Blues):    plt.imshow(cm, interpolation='nearest', cmap=cmap)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(classes))    plt.xticks(tick_marks, classes, rotation=0)    plt.yticks(tick_marks, classes)    thresh = cm.max() / 2.    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):        plt.text(j, i, cm[i, j],                 horizontalalignment="center",                 color="white" if cm[i, j] > thresh else "black")    plt.tight_layout()    plt.ylabel('True label')    plt.xlabel('Predicted label')

接下来利用过采样来处理数据

columns=data.columnsfeatures_columns=columns.delete(len(columns)-1)features=data[features_columns]labels=data['Class']features_train, features_test, labels_train, labels_test = train_test_split(features,                                                                             labels,                                                                             test_size=0.3,                                                                             random_state=0)oversampler=SMOTE(random_state=0)os_features,os_labels=oversampler.fit_sample(features_train,labels_train)os_features = pd.DataFrame(os_features)os_labels = pd.DataFrame(os_labels)#print len(os_labels[os_labels==1])

分离数据中的特征和标签
将数据分成训练数据和测试数据，比例7:3。
利用SMOTE来处理训练样本，得到均衡的训练样本

过采样

过采样是对样本中少的数量较少的那一类进行生成补齐。最常用的一种方法是SMOTE算法。
关于SMOTE的详细介绍见这篇文献SMOTE: Synthetic Minority Over-sampling Technique

假设我们有100个负样本，有600个正样本，那要用100个负样本过采样，再生成500个负样本。

对100个负样本每个都求其k近邻。也就是先求一个负样本和其他99个负样本的欧式距离(a1−b1)2+(a2−b2)2……(an−bn)2−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√（其中a与b分别为求欧式距离的两样本的对应特征值），然后按照求得的距离从小到大将99个负样本进行排序，前k个就是该负样本的k近邻。
对每一个负样本在其k近邻中随机选5个进行样本生成。xn表示新生成的样本的第n个特征，an表示该负样本第n个特征，bn表示该负样本的一个k近邻的第n个特征。那么生成新样本的第n个特征可以表示为：
$x n = a n + r a n d (0, 1) \times (b n - a n)$

best_c = printing_Kfold_scores(os_features,os_labels)lr = LogisticRegression(C = best_c, penalty = 'l1')lr.fit(os_features,os_labels.values.ravel())y_pred = lr.predict(features_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(labels_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix                      , classes=class_names                      , title='Confusion matrix')plt.show()

求得best_c，用下采样得到的数据训练模型，然后用原始数据中的测试集测试：

相比于上一篇下采样的测试结果，过采样使得模型的recall进一步提高（训练数据多了，模型固然更优），最主要的是误杀率降了很多。从原来的误杀6973到现在的6个。

阅读全文

0 0