使用sklearn 实现 Logistics Regression 分类
来源:互联网 发布:阿里云快照欠费 编辑:程序博客网 时间:2024/06/15 21:08
使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类
(记录一次Data Mining作业)
数据描述与分析
我们有这么一个数据集,记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)
user_id: Identifies the individual who is performing the action.
session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value.
load_video: This tag appears when the video is rendered and ready to play.
play_video: This tag appears when a user selects the video player’s play control.
pause_video: This tag appears when a user select the video player’s pause control.
seek_video: This tag appears when a user selects a user interface control to go to a different
point in the video file.stop_video: This tag appears when the video player reaches the end of the video file and play
automatically stops.speed_change_video: This tag appears when a user selects a different playing speed for the video.
event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format.
new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only.
old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only.
old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only.
new_speed: The speed that the user selected for the video to play. This filed appears for
speed_change_video action only.grade: Final performance status, 0 for not pass and 1 for pass
训练环境
OS: Win 10
Python version:3.6.3
Scikit-learn: 0.19.1
Pandas: 0.21.0
Numpy: 1.13.3
A typical example is run as:
python lr.py
特征选择
- The number of videos that student have watched.
- The times that student watch the videos.
- The times that student pause the videos when watching.
- The times that student stop the videos when watching.
- The times that student change the videos speed when watching.
- the number of session of one student ( the times that student open the browser to watch the video )
PS: 当然这是些很简单的特征,数据集里面的时间等都没用上。
模型选择(当然是选择LR)
Use the logistic regression model.
Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).
The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
Binary class L2 penalized logistic regression minimizes the following cost function:
sklearn 中 LogisticRegression 参数默认值
class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)
我们在训练时可以直接使用默认参数,当然也可以根据数据集合理设置theta调参
输出结果
0.860396039604
0.866336633663
0.890099009901
0.869306930693
0.869306930693
0.880198019802
0.862376237624
0.870297029703
0.892079207921
0.887128712871precision recall f1-score support
neg 0.93 0.93 0.93 827
pos 0.69 0.68 0.69 183avg / total 0.89 0.89 0.89 1010
time spent: 7.203231573104858
绘制出P/R 图 (AUC = 0.5):
参考代码
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import precision_recall_curve, roc_curve, aucfrom sklearn.metrics import classification_reportfrom matplotlib import pyplotfrom matplotlib import pylabimport pandas as pdimport numpy as npimport timestart_time = time.time()trainDf = pd.read_csv('TrainFeatures.csv')testDf = pd.read_csv('TestFeatures.csv')labelDf = pd.read_csv('TrainLabel.csv')# Draw R/P Curvedef plot_pr(auc_score, precision, recall, label=None): pylab.figure(num=None, figsize=(6, 5)) pylab.xlim([0.0, 1.0]) pylab.ylim([0.0, 1.0]) pylab.xlabel('Recall') pylab.ylabel('Precision') pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label)) pylab.fill_between(recall, precision, alpha=0.5) pylab.grid(True, linestyle='-', color='0.75') pylab.plot(recall, precision, lw=1) pylab.show()# do data cleaning jobdef data_cleaning(df): # Feature for video number for one student video_number = df.iloc[:, 0:2].drop_duplicates().dropna() video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes') # Feature for session session_number = df.iloc[:, [0, 2]].drop_duplicates() session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount') # Feature for video event type video_type_number = df.iloc[:, [0, 7]].dropna() video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\ .reset_index(name='video_type_number') # select event_type == play_video play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1) pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1) seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1) stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1) speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\ .drop(['event_type'], axis=1) # rename columns play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True) pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True) seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True) stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True) speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True) # merger the columns by key = user_id feature_df = pd.merge(video_number, session_number, on='user_id', how='outer') feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer') feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer') feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer') feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer') feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer') # replace NAN to 0 feature_df = feature_df.fillna(0) return feature_dftrainingFeature = data_cleaning(trainDf)testingFeature = data_cleaning(testDf)trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')# trainingFeature.to_csv('cleaning_data_training.csv')# testingFeature.to_csv('cleaning_data_testing.csv')# training modelaverage = 0testNum = 10for i in range(0, testNum): X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8], test_size=0.2) lr = LogisticRegression() lr.fit(X_train, y_train) y_pred = lr.predict(X_test) p = np.mean(y_pred == y_test) print(p) average += p# precision and recallanswer = lr.predict_proba(X_test)[:, 1]precision, recall, thresholds = precision_recall_curve(y_test, answer)report = answer > 0.5print(classification_report(y_test, report, target_names=['neg', 'pos']))print("average precision:", average / testNum)print("time spent:", time.time() - start_time)plot_pr(0.5, precision, recall, "pos")# predict testing datapredict = lr.predict(testingFeature.iloc[:, 1:7])output = pd.DataFrame(predict.T, columns=['grade'])output.insert(0, 'user_id', testingFeature.iloc[:, 0])output.to_csv('prediction.csv', index=False)
参考文献
- http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Sklearn documentation
- 李航, 统计学习方法
- https://czep.net/stat/mlelr.pdf Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation
- 使用sklearn 实现 Logistics Regression 分类
- logistics regression 的C++实现
- TensorFlow 使用之 Logistics Regression
- 【机器学习 sklearn】逻辑斯蒂回归模型--Logistics regression
- 分类问题:logistics Regression的方法及步骤
- sklearn中Logistics Regression的coef_和intercept_的具体意义
- sklearn中Logistics Regression的coef_和intercept_的具体意义
- Python Multinomial Logistics 实现MNIST分类
- 【笔记+实战】Logistics Regression
- Linear Regression 线性回归sklearn python实现
- 使用sklearn实现朴素贝叶斯文本分类
- 利用sklearn 实现SVM分类
- 机器学习之logistics regression
- # linear regression & logistics regression学习笔记
- logistics图像分类器
- logistics回归分类图片
- 使用 sklearn 实现决策树
- 机器学习之逻辑回归(logistics regression)代码(牛顿法实现)
- 初始学习python
- ViewBag在网页上的使用
- 四位随机数
- java多线程:并发包中ReentrantReadWriteLock读写锁的锁降级模板 写锁降级为读锁
- Java类集
- 使用sklearn 实现 Logistics Regression 分类
- 这几个动图告诉你科学的神奇,看完瞬间觉得智商都提高了
- android Log详解
- 关于在两个jsp页面之间传递值的问题
- js延时触发
- [读书总结]-《哪有没时间这回事》
- C#调用DLL复杂函数结构体
- mysqldump原理解析
- 校招面试知识点复习之计算机网络