信用卡欺诈案例

来源：互联网发布：女朋友是双性恋知乎编辑：程序博客网时间：2024/04/29 02:01

# 信用卡欺诈人员预测

import pandas as pdimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline

data = pd.read_csv('creditcard.csv')data.head()#type(data)

pandas.core.frame.DataFrame

data_view = pd.value_counts(data['Class'],sort=True).sort_index()data_view.plot(kind='bar')plt.title("Fraud record")plt.xlabel("Class")plt.ylabel("Frequency")

from sklearn.preprocessing import StandardScaler#reshape(-1,1) 智能化矩阵转化方法，-1代表自己预测，1代表将原来的数据转换成1列，#至于几行自己转换就可以data['nomAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))data = data.drop(['Time','Amount'],axis=1)data.head()

/home/yanghua/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:4: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(…) instead after removing the cwd from sys.path.

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 … V21 V22 V23 V24 V25 V26 V27 V28 Class nomAmount 0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 … -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 0 0.244964 1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 … -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 0 -0.342475 2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 … 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 0 1.160686 3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 … -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 0 0.140534 4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 … -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 0 -0.073403

5 rows × 30 columns

————————————————–样本的下采样———————————————————–
numpy：以矩阵为基础的运算库，主要是进行矩阵的操作和各种计算。数据一ndarray的方式携带
pandas：底层数据是DataFrame这个更像是个图表结构。相当于把Excel等直接放到了内存中，类比方便
易于理解。

#不知到在这里为啥要进行切分，就跟着做呗。x = data.ix[:,data.columns != 'Class' ]y = data.ix[:,data.columns == 'Class']#获取小样本的数量和相应的下标numbers_record_fraud = len(data[data['Class'] == 1])#record_fraud_indices = data[data['Class'] == 1].index#print record_fraud_indicesrecord_fraud_indices = np.array(data[data['Class'] == 1].index)#print record_fraud_indices#获取unfraud数据的下标，为了随机选取小样本数量做准备numbers_record_unfraud = len(data[data['Class'] == 0])record_unfraud_indices = np.array(data[data['Class'] == 0].index)#随机选取等数量的下标random_unfraud_indices = np.array(np.random.choice(record_unfraud_indices,numbers_record_fraud,replace = False))#print len(random_unfraud_indices)#将两份数据进行整合生成新的平衡样本under_sample_indices = np.concatenate([random_unfraud_indices,record_fraud_indices])#样本的下标是针对Class的下标,iloc是pandas的方法under_sample_dataset = data.iloc[under_sample_indices,:]x_undersample = under_sample_dataset.ix[:,under_sample_dataset.columns != 'Class']y_undersample = under_sample_dataset.ix[:,under_sample_dataset.columns == 'Class']#print len(under_sample_dataset)

/home/yanghua/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix/home/yanghua/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:21: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix

交叉验证。
将数据进行切分
80%为训练数据，20%为测试数据
再将80%的训练集进行随机等分为三份（1,2,3）
我们可以使用交叉验证的方法
1,2建立 3验证
1,3建立 2验证
2,3建立 1验证

#sklearn有专门进行交叉验证的模块拿来直接用就好了from sklearn.cross_validation import train_test_split#对所有的数据进行切分,sklearn进行数据切分的时候是一个随机过程，首先对数据进行一次洗牌#然后在随机的进行切分x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)#对小样本数据进行切分x_train_undersample,x_test_undersample,y_train_undersample,y_test_undersample = train_test_split(x_undersample,                                                                                                y_undersample,test_size=0.3,random_state=0)

/home/yanghua/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.  "This module will be removed in 0.20.", DeprecationWarning)

模型的评估标准：
对建立好的模型的效能进行一系列的评估，我们使用一种称为recall的方法。我们已经知道有几个是需要分类出来的，我们通过模型的输出有效分类的数目与总共应该分类的数目进行比对，这个比值就是recall值，recall值的大小可以用来衡量这个模型的好坏。
FN(False nagative) : 将样本错误的判段为负类
TN(True nagetive) : 将样本正确的判断为负类
TP (False positive) : 将样本错误的判断为正类
TN(True Nagetive) ：将样本正确的判断为负类
模型的选择方法：
1 正则化
相关的内容在吴恩达讲稿和书上都有，不再赘述
2 交叉验证

#Recall = TP/(TP+FN)from sklearn.linear_model import LogisticRegression#KFold 交叉验证模块方法，可以用来指定交叉验证交叉数目，等等#cross_val_score 返回交叉验证值的结果from sklearn.cross_validation import KFold, cross_val_score#混淆矩阵-------------------展现（某次预测）中TP，TN，FN，TN的分布情况from sklearn.metrics import confusion_matrix,recall_score,classification_report

#其中一些参数的类型？？？？？？？？？def printing_Kfold_scores(x_train_data,y_train_data):    #设置交叉验证模式    #y_train_data与x_trian_data数据长度相同，这里仅仅是想统计一下长度而已    fold = KFold(len(y_train_data),5,shuffle=False)    #设置不同的正则化参数    c_param_range = [0.01,0.1,1,10,100]    #通过pandas建立一个结果数据表，将每个正则化参数下的迭代结果进行统计，方便以后的打印和选取工作    #result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall value'])    #result_table['C_parameter']=c_param_range    result_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])    result_table['C_parameter'] = c_param_range    j=0    for c_param in c_param_range:        print('-------------------------------------------')        print('C parameter: ', c_param)        print('-------------------------------------------')        print('')        recall_accs=[]        for iteration,indices in enumerate(fold,start=1):           # print indices            #调用回归模型,penalty惩罚类型 l1  l2都可，属于范数类型            lr = LogisticRegression(C=c_param,penalty='l1')            #进行数据的训练,每次迭代默认为indices[0]训练，indices[1]测试            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0]].values.ravel())            #数据测试            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)            #将预测结果传输到recall计算函数中，计算出此次迭代产生recall率            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)            recall_accs.append(recall_acc)            print('Iteration ', iteration,': recall score = ', recall_acc)            result_table.ix[j,'Mean recall score'] = np.mean(recall_accs)              j += 1            print('')            print('Mean recall score ', np.mean(recall_accs))            print('')        best_c = result_table.loc[result_table['Mean recall score'].idxmax()]['C_parameter']             # Finally, we can check which C parameter is the best amongst the chosen.    print('*********************************************************************************')    print('Best model to choose from cross validation is with C parameter = ', best_c)    print('*********************************************************************************')    return best_c

best_c = printing_Kfold_scores(x_train_undersample,y_train_undersample)

-------------------------------------------('C parameter: ', 0.01)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.98461538461538467)('Mean recall score ', 0.98461538461538467)('Iteration ', 2, ': recall score = ', 0.9538461538461539)('Mean recall score ', 0.96923076923076934)('Iteration ', 3, ': recall score = ', 0.88607594936708856)('Mean recall score ', 0.94151249594287567)('Iteration ', 4, ': recall score = ', 0.98412698412698407)('Mean recall score ', 0.95216611798890283)('Iteration ', 5, ': recall score = ', 0.95774647887323938)('Mean recall score ', 0.9532821901657702)-------------------------------------------('C parameter: ', 0.1)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.92307692307692313)('Mean recall score ', 0.92307692307692313)('Iteration ', 2, ': recall score = ', 0.87692307692307692)('Mean recall score ', 0.90000000000000002)('Iteration ', 3, ': recall score = ', 0.83544303797468356)('Mean recall score ', 0.87848101265822776)/home/yanghua/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:39: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix('Iteration ', 4, ': recall score = ', 0.93650793650793651)('Mean recall score ', 0.89298774362065503)('Iteration ', 5, ': recall score = ', 0.91549295774647887)('Mean recall score ', 0.89748878644581986)-------------------------------------------('C parameter: ', 1)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.93846153846153846)('Mean recall score ', 0.93846153846153846)('Iteration ', 2, ': recall score = ', 0.87692307692307692)('Mean recall score ', 0.90769230769230769)('Iteration ', 3, ': recall score = ', 0.83544303797468356)('Mean recall score ', 0.88360921778643287)('Iteration ', 4, ': recall score = ', 0.93650793650793651)('Mean recall score ', 0.89683389746680886)('Iteration ', 5, ': recall score = ', 0.91549295774647887)('Mean recall score ', 0.90056570952274284)-------------------------------------------('C parameter: ', 10)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.9538461538461539)('Mean recall score ', 0.9538461538461539)('Iteration ', 2, ': recall score = ', 0.86153846153846159)('Mean recall score ', 0.9076923076923078)('Iteration ', 3, ': recall score = ', 0.83544303797468356)('Mean recall score ', 0.88360921778643309)('Iteration ', 4, ': recall score = ', 0.93650793650793651)('Mean recall score ', 0.89683389746680886)('Iteration ', 5, ': recall score = ', 0.91549295774647887)('Mean recall score ', 0.90056570952274284)-------------------------------------------('C parameter: ', 100)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.9538461538461539)('Mean recall score ', 0.9538461538461539)('Iteration ', 2, ': recall score = ', 0.86153846153846159)('Mean recall score ', 0.9076923076923078)('Iteration ', 3, ': recall score = ', 0.82278481012658233)('Mean recall score ', 0.87938980850373261)('Iteration ', 4, ': recall score = ', 0.93650793650793651)('Mean recall score ', 0.89366934050478353)('Iteration ', 5, ': recall score = ', 0.91549295774647887)('Mean recall score ', 0.89803406395312257)*********************************************************************************('Best model to choose from cross validation is with C parameter = ', 0.01)*********************************************************************************

绘制混淆矩阵，通过混淆矩阵分析各种采样带来的结果影响
“““““““““`这个矩阵咋画的我还真是不知道““““““““““`

def plot_confusion_matrix(cm, classes,                          title='Confusion matrix',                          cmap=plt.cm.Blues):    plt.imshow(cm, interpolation='nearest', cmap=cmap)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(classes))    plt.xticks(tick_marks, classes, rotation=0)    plt.yticks(tick_marks, classes)    thresh = cm.max() / 2.    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):        plt.text(j, i, cm[i, j],                 horizontalalignment="center",                 color="white" if cm[i, j] > thresh else "black")    plt.tight_layout()    plt.ylabel('True label')    plt.xlabel('Predicted label')

`从所画的”下采样”矩阵中我们分析出对于FP和FN占有很小的比重，这一个对于recall值的计算很明显了。Recall = TP/(TP+FN)

import itertoolslr = LogisticRegression(C = best_c, penalty = 'l1')lr.fit(x_train_undersample,y_train_undersample.values.ravel())y_pred_undersample = lr.predict(x_test_undersample.values)# Compute confusion matrixcnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix                      , classes=class_names                      , title='Confusion matrix')plt.show()

('Recall metric in the testing dataset: ', 0)

png

那么对于全部的数据我们进行同样的训练，结果如何呢？同样使用混淆矩阵来展现。
`观察混淆矩阵我们发现，在FP区域有7000多数据被误分类了，说明通过向下取样的方法，并不是很好

#使用小样本训练的结果对大样本进行一个预测lr = LogisticRegression(C=best_c,penalty='l1')lr.fit(x_train_undersample,y_train_undersample.values.ravel())y_pred = lr.predict(x_test.values)#计算混淆矩阵cnf_matrix = confusion_matrix(y_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix                      , classes=class_names                      , title='Confusion matrix')plt.show()

('Recall metric in the testing dataset: ', 0)

png

打印看一下在把全部样本直接进行训练时的recall的值分布，发现更差劲！
`那怎么往下做呢？？？

best_c = printing_Kfold_scores(x_train,y_train)

-------------------------------------------('C parameter: ', 0.01)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.4925373134328358)('Mean recall score ', 0.4925373134328358)/home/yanghua/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:39: DeprecationWarning: .ix is deprecated. Please use.loc for label based indexing or.iloc for positional indexingSee the documentation here:http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix('Iteration ', 2, ': recall score = ', 0.60273972602739723)('Mean recall score ', 0.54763851973011657)('Iteration ', 3, ': recall score = ', 0.68333333333333335)('Mean recall score ', 0.59287012426452212)('Iteration ', 4, ': recall score = ', 0.56923076923076921)('Mean recall score ', 0.58696028550608392)('Iteration ', 5, ': recall score = ', 0.45000000000000001)('Mean recall score ', 0.5595682284048672)-------------------------------------------('C parameter: ', 0.1)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.56716417910447758)('Mean recall score ', 0.56716417910447758)('Iteration ', 2, ': recall score = ', 0.61643835616438358)('Mean recall score ', 0.59180126763443064)('Iteration ', 3, ': recall score = ', 0.68333333333333335)('Mean recall score ', 0.62231195620073154)('Iteration ', 4, ': recall score = ', 0.58461538461538465)('Mean recall score ', 0.61288781330439479)('Iteration ', 5, ': recall score = ', 0.52500000000000002)('Mean recall score ', 0.59531025064351584)-------------------------------------------('C parameter: ', 1)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.55223880597014929)('Mean recall score ', 0.55223880597014929)('Iteration ', 2, ': recall score = ', 0.61643835616438358)('Mean recall score ', 0.58433858106726644)('Iteration ', 3, ': recall score = ', 0.71666666666666667)('Mean recall score ', 0.62844794293373318)('Iteration ', 4, ': recall score = ', 0.61538461538461542)('Mean recall score ', 0.62518211104645371)('Iteration ', 5, ': recall score = ', 0.5625)('Mean recall score ', 0.61264568883716297)-------------------------------------------('C parameter: ', 10)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.55223880597014929)('Mean recall score ', 0.55223880597014929)('Iteration ', 2, ': recall score = ', 0.61643835616438358)('Mean recall score ', 0.58433858106726644)('Iteration ', 3, ': recall score = ', 0.73333333333333328)('Mean recall score ', 0.63400349848928872)('Iteration ', 4, ': recall score = ', 0.61538461538461542)('Mean recall score ', 0.62934877771312037)('Iteration ', 5, ': recall score = ', 0.57499999999999996)('Mean recall score ', 0.61847902217049633)-------------------------------------------('C parameter: ', 100)-------------------------------------------('Iteration ', 1, ': recall score = ', 0.55223880597014929)('Mean recall score ', 0.55223880597014929)('Iteration ', 2, ': recall score = ', 0.61643835616438358)('Mean recall score ', 0.58433858106726644)('Iteration ', 3, ': recall score = ', 0.73333333333333328)('Mean recall score ', 0.63400349848928872)('Iteration ', 4, ': recall score = ', 0.61538461538461542)('Mean recall score ', 0.62934877771312037)('Iteration ', 5, ': recall score = ', 0.57499999999999996)('Mean recall score ', 0.61847902217049633)*********************************************************************************('Best model to choose from cross validation is with C parameter = ', nan)*********************************************************************************

上文的所有预测中使用LogisticRegression中的predict方法，返回的是一个预测结果。默认Sigmod函数的阈值为0.5
在很多时候我们可以自由指定这个阈值，当使用predic_prob()的时候返回的是一个概率。
`X_test = [[2,3,4,5]，[3,4,5,6]]

假设分类结果为可能为0，1两类

model.predict_proba(X_test)=
array([[0.1,0.9], #代表[2,3,4,5]被判断为0的概率为0.1，被判断为1的概率为0.9
[0.8,0.2]]) #代表[3,4,5,6]被判断为0的概率为0.8，被判断为1的概率为0.2`

lr = LogisticRegression(C = 0.01, penalty = 'l1')lr.fit(x_train_undersample,y_train_undersample.values.ravel())y_pred_undersample_proba = lr.predict_proba(x_test_undersample.values)thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]plt.figure(figsize=(10,10))j = 1for i in thresholds:    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i    plt.subplot(3,3,j)    j += 1    # Compute confusion matrix    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)    np.set_printoptions(precision=2)    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))    # Plot non-normalized confusion matrix    class_names = [0,1]    plot_confusion_matrix(cnf_matrix                          , classes=class_names                          , title='Threshold >= %s'%i)

---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)<ipython-input-11-2b493d4ae9b9> in <module>()----> 1 lr = LogisticRegression(C = 0.01, penalty = 'l1')      2 lr.fit(x_train_undersample,y_train_undersample.values.ravel())      3 y_pred_undersample_proba = lr.predict_proba(x_test_undersample.values)      4       5 thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]NameError: name 'LogisticRegression' is not defined

通过分析各个图表的精度（预测正确的除以总量）和recall值（TP/（TP+FN））我们可以综合分析出一个合理的模型。

import pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_splitfrom imblearn.over_sampling import SMOTE

---------------------------------------------------------------------------ImportError                               Traceback (most recent call last)<ipython-input-8-e1aa2a7e4955> in <module>()      3 from sklearn.metrics import confusion_matrix      4 from sklearn.model_selection import train_test_split----> 5 from imblearn.over_sampling import SMOTE/home/yanghua/anaconda2/lib/python2.7/site-packages/imblearn/__init__.py in <module>()     29 from .version import _check_module_dependencies, __version__     30 ---> 31 _check_module_dependencies()     32      33 # Boolean controlling whether the joblib caches should be/home/yanghua/anaconda2/lib/python2.7/site-packages/imblearn/version.pyc in _check_module_dependencies(is_imbalanced_dataset_installing)    100                 module_name=module_name,    101                 minimum_version=module_metadata['min_version'],--> 102                 install_info=module_metadata.get('install_info'))/home/yanghua/anaconda2/lib/python2.7/site-packages/imblearn/version.pyc in _import_module_with_version_check(module_name, minimum_version, install_info)     75                        module_version=module_version)     76 ---> 77         raise ImportError(message)     78      79     return moduleImportError: A sklearn version of at least 0.19.0 is required to use imbalanced-learn. 0.18.1 was found. Please upgrade sklearn

credit_cards=pd.read_csv('creditcard.csv')columns=credit_cards.columns# The labels are in the last column ('Class'). Simply remove it to obtain features columnsfeatures_columns=columns.delete(len(columns)-1)features=credit_cards[features_columns]labels=credit_cards['Class']

lr = LogisticRegression(C = best_c, penalty = 'l1')lr.fit(os_features,os_labels.values.ravel())y_pred = lr.predict(features_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(labels_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix                      , classes=class_names                      , title='Confusion matrix')plt.show()

---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)<ipython-input-10-01679eaeabd5> in <module>()----> 1 lr = LogisticRegression(C = best_c, penalty = 'l1')      2 lr.fit(os_features,os_labels.values.ravel())      3 y_pred = lr.predict(features_test.values)      4       5 # Compute confusion matrixNameError: name 'LogisticRegression' is not defined

总结

                @流程------ 数据准备                     1 明确目标，这是个分类算法。                     2 引入数据，观察样本数据的分布。因为是分类所以使用柱状图来进行区分。经过分类后发现数据存在分布不平衡的问题，对于数据分布不平衡一般使用”下采样“和”上采样“两种方式。                     下采样：大样本数量随机选择与小样本等量的数据                     过采样：通过使用SMOTE算法生成样本                     3 观察数据的分布情况，发现Amount的数据跨度较大。所以要对其使用Feature Scaling，使得样本的各个特征都稳定在一个相差不大的范围里面                     4 对数据进行分割，将数据分为训练数据和验证数据两个部分。这个过程sklearn可以自动的处理好

————————————————-以上对数据本身进行处理，少了特征提取的过程——————————————————————

                @流程 ------ 构建模型训练基础                      1 交叉验证。之所以使用交叉验证是为了在有限的数据集上对数据进行反复使用，以求得一个最好的模型。在本案例中我们设定了进行五次交叉验证，并且打印了每一次交叉验证的结果。                      2 正则化。防止出现数据的over-fitting现象一般使用这一方法，但是使用正则化的时候对于正则的参数对我们模型的影响非常大，所以我们就想到了使用迭代对每一个正则参数进行交叉验证，这样提高了每一个正则参数性能的可信度。                    （  3 其实这里还有一个LogisticsRegression的专有的参数threshold，来自于sigmoid函数默认情况下机器设定为0.5但是这个参数也是可以人为的去指定的。这个问题我们上面都已经有讨论了。）                      3 执行训练函数，进行数据的训练

————————————————-以上对数据进行训练，有了最好的正则化参数——————————————————————

                 @流程 ---- 测试训练模型的性能如何                       1 测试模型好坏的方法我们有两种方案。一个是通过精度，一个是通过recall值。两者必须综合考虑，或者根据项目的实际要求来综合选择一个结果。在里面我们使用一个叫做混淆矩阵的东西，来对这两个数据进行分辨

——————————–训练完成，有了参数。还可以通过使用其他采样啊之类的方法来进行重新训练————————————-

小感悟

        这是我看到的第一个案例，通过这个案例给我感觉是机器学习这个东西。从大处来说，数据准备，训练参数，验证。但是个人觉得，核心点在于数据预处理的好坏。并且对于算法的核心思想推导一定要十分清晰才可以。

阅读全文

0 1