使用sklearn 实现 Logistics Regression 分类

来源:互联网 发布:阿里云快照欠费 编辑:程序博客网 时间:2024/06/15 21:08

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类

(记录一次Data Mining作业)

数据描述与分析

我们有这么一个数据集,记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)
DataSet

user_id: Identifies the individual who is performing the action.

session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value.

load_video: This tag appears when the video is rendered and ready to play.

play_video: This tag appears when a user selects the video player’s play control.

pause_video: This tag appears when a user select the video player’s pause control.

seek_video: This tag appears when a user selects a user interface control to go to a different
point in the video file.

stop_video: This tag appears when the video player reaches the end of the video file and play
automatically stops.

speed_change_video: This tag appears when a user selects a different playing speed for the video.

event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format.

new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only.

old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only.

old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only.

new_speed: The speed that the user selected for the video to play. This filed appears for
speed_change_video action only.

grade: Final performance status, 0 for not pass and 1 for pass

训练环境

OS: Win 10
Python version:3.6.3
Scikit-learn: 0.19.1
Pandas: 0.21.0
Numpy: 1.13.3
A typical example is run as:

python lr.py

特征选择

  1. The number of videos that student have watched.
  2. The times that student watch the videos.
  3. The times that student pause the videos when watching.
  4. The times that student stop the videos when watching.
  5. The times that student change the videos speed when watching.
  6. the number of session of one student ( the times that student open the browser to watch the video )

PS: 当然这是些很简单的特征,数据集里面的时间等都没用上。

模型选择(当然是选择LR)

Use the logistic regression model.

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).
The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
Binary class L2 penalized logistic regression minimizes the following cost function:
cost function

sklearn 中 LogisticRegression 参数默认值

class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)

我们在训练时可以直接使用默认参数,当然也可以根据数据集合理设置theta调参

输出结果

0.860396039604
0.866336633663
0.890099009901
0.869306930693
0.869306930693
0.880198019802
0.862376237624
0.870297029703
0.892079207921
0.887128712871

precision recall f1-score support
neg 0.93 0.93 0.93 827
pos 0.69 0.68 0.69 183

avg / total 0.89 0.89 0.89 1010

time spent: 7.203231573104858

绘制出P/R 图 (AUC = 0.5):
P/R curve

参考代码

from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import precision_recall_curve, roc_curve, aucfrom sklearn.metrics import classification_reportfrom matplotlib import pyplotfrom matplotlib import pylabimport pandas as pdimport numpy as npimport timestart_time = time.time()trainDf = pd.read_csv('TrainFeatures.csv')testDf = pd.read_csv('TestFeatures.csv')labelDf = pd.read_csv('TrainLabel.csv')# Draw R/P Curvedef plot_pr(auc_score, precision, recall, label=None):    pylab.figure(num=None, figsize=(6, 5))    pylab.xlim([0.0, 1.0])    pylab.ylim([0.0, 1.0])    pylab.xlabel('Recall')    pylab.ylabel('Precision')    pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))    pylab.fill_between(recall, precision, alpha=0.5)    pylab.grid(True, linestyle='-', color='0.75')    pylab.plot(recall, precision, lw=1)    pylab.show()# do data cleaning jobdef data_cleaning(df):    # Feature for video number for one student    video_number = df.iloc[:, 0:2].drop_duplicates().dropna()    video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes')    # Feature for session    session_number = df.iloc[:, [0, 2]].drop_duplicates()    session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount')    # Feature for video event type    video_type_number = df.iloc[:, [0, 7]].dropna()    video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\        .reset_index(name='video_type_number')    # select event_type == play_video    play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1)    pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1)    seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1)    stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1)    speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\        .drop(['event_type'], axis=1)    # rename columns    play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True)    pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True)    seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True)    stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True)    speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True)    # merger the columns by key = user_id    feature_df = pd.merge(video_number, session_number, on='user_id', how='outer')    feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer')    feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer')    feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer')    feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer')    feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer')    # replace NAN to 0    feature_df = feature_df.fillna(0)    return feature_dftrainingFeature = data_cleaning(trainDf)testingFeature = data_cleaning(testDf)trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')# trainingFeature.to_csv('cleaning_data_training.csv')# testingFeature.to_csv('cleaning_data_testing.csv')# training modelaverage = 0testNum = 10for i in range(0, testNum):    X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8],                                                    test_size=0.2)    lr = LogisticRegression()    lr.fit(X_train, y_train)    y_pred = lr.predict(X_test)    p = np.mean(y_pred == y_test)    print(p)    average += p# precision and recallanswer = lr.predict_proba(X_test)[:, 1]precision, recall, thresholds = precision_recall_curve(y_test, answer)report = answer > 0.5print(classification_report(y_test, report, target_names=['neg', 'pos']))print("average precision:", average / testNum)print("time spent:", time.time() - start_time)plot_pr(0.5, precision, recall, "pos")# predict testing datapredict = lr.predict(testingFeature.iloc[:, 1:7])output = pd.DataFrame(predict.T, columns=['grade'])output.insert(0, 'user_id', testingFeature.iloc[:, 0])output.to_csv('prediction.csv', index=False)

参考文献

  1. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Sklearn documentation
  2. 李航, 统计学习方法
  3. https://czep.net/stat/mlelr.pdf Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation
原创粉丝点击