Account Sharing Classification Model
来源:互联网 发布:南风知我意txt百度云 编辑:程序博客网 时间:2024/06/03 22:39
Author:Zhang Felix
Introduction
In the age of internet, online shopping has created a revolution for consumers, growing rapidly year by year and offering a Golden Age of shopping. User account sharing is quite a common phenomenon in online shopping. People are willing to share user account in all purposes. Sellers are sharing their user accounts with employees to sell in a more efficient way. Buyers are sharing their user account with family, friends, roommates to enable more convenience user experience. In this paper, we present a logic based study to define account sharing behavior in eBay marketplace. Using data from a large eBay behavior data platform, we show that there are discernible and noteworthy patterns of account sharing behavior. We also build account sharing classification model to capture shared accounts using user demographic, pre-switch, post-switch, switch transition and meta session level features. Our findings show that our model can attain strong account sharing detection accuracy and improve personalization experience in eBay and enable analytics for device switch.
RELATED WORK
In this paper our focus is on identifying account sharing users. Related work falls in the following areas: (1) Connecting sessions in device switch. (2) Account sharing behavior classification.
Multi-screen usage is the key to the account sharing model. The concept of multi-screen has been raised by Google 2 years ago. Device switch has been studied intensively, behavior data from log server have proven to be extremely valuable in studying how people switch device to another. In our case of eBay marketplace, there are three device categories: desktop, cellphone, tablet. We are connecting these switching sessions from one device category to another. We call these sessions as “Meta Session” in this project. There are three types of meta session.
First is sequential, the session 2 starts after session 1 within 30 mins.
Second is overlapping, session 2 overlaps with session 1.
Third is subsuming, session1 contains session 2.
Modeling
In order to build the account sharing classification model, we created 5 groups of features.
Feature Group 1 - User Demographic Features
Specification: This is a group of features which is related to basic attributions for eBay user.
- Buyer/Seller Indicator
- Gender
- Age Group
- Number of Children
- Single/Family Indicator
- Seller Level (CSS)
Feature Group 2 - Pre-Switch Features
Specification: This is a group of features related to pre-switch session.
- Device category
- Avg leaf male fraction
- Avg leaf female fraction
- Session Duration
Feature Group 3: Post-Switch Features
Specification: This is a group of features related to post-switch session.
- Device category
- Avg leaf male fraction
- Avg leaf female fraction
- Session duration
Feature Group 4 - Switch Transition Features
Specification: This is a group of features related to transition attribution between pre-switch session and post-switch session.
- Distance
- Moving Speed
- Sequential gap duration
- Overlap gap duration
- Event Gap Count for 1 & 2 second threshold
- Switch hour bucket
- Device pair frequent count
- Leaf gender variance score
- Meta Category similarity Score
- Notification pair indicator
Feature Group 5 - Meta Session Features
Specification: This is a group of features related to overall account daily performance in the meta session.
- # unique devices
- # unique cellphones
- # unique desktops
- # unique tablets
We added feature groups step by step to train our GBM model, the following the Specificity vs Sensitivity plot.
M1: we use user demographic features + pre-switch session features
M2: we use M1 + post-switch feature
M3:we use M2 + between-switch transition feature
M4: we use M3 + meta session level features
M5: we use Ensemble model based on 8 M4 models
GBM Parameter Tuning
We have tuned the optimized model parameters by grid search,
finally, we got the best parameters: tree=350, depth=3 and shrinkage = 0.15,
so for the further model tuning, we will apply these 3 parameters to all the models.
# of Trees = 350:
Stochastic Gradient Boosting
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 44999, 45001, 45000
Resampling results across tuning parameters:
n.trees ROC Sens Spec ROC SD Sens SD Spec SD
100 0.958 0.973 0.472 0.00064 0.000983 0.00284
150 0.961 0.971 0.512 0.000583 0.00121 0.0046
200 0.962 0.969 0.55 0.000406 0.000684 0.00861
250 0.963 0.968 0.568 0.000547 5.69e-05 0.0103
300 0.963 0.967 0.582 0.000506 0.000591 0.00743
350 0.964 0.966 0.595 0.000618 0.000402 0.00395
400 0.964 0.966 0.599 0.00063 0.000421 0.00409
450 0.964 0.966 0.599 0.000767 0.000427 0.00686
500 0.964 0.966 0.605 0.000733 0.000733 0.0112
# interaction.depth = 3:
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 45000, 44999, 45001
Resampling results across tuning parameters:
interaction.depth ROC Sens Spec ROC SD Sens SD Spec SD
1 0.943 0.978 0.338 0.00238 0.00233 0.0177
3 0.963 0.965 0.597 0.00136 0.000994 0.00711
5 0.964 0.964 0.628 0.00159 0.00108 0.00653
7 0.964 0.963 0.629 0.00155 0.00158 0.00479
Shrinkage=0.15:
Stochastic Gradient Boosting
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 45000, 44999, 45001
Resampling results across tuning parameters:
shrinkage ROC Sens Spec ROC SD Sens SD Spec SD
0.05 0.961 0.97 0.531 0.000171 0.00197 0.0166
0.1 0.963 0.967 0.588 0.000417 0.00118 0.0107
0.15 0.964 0.965 0.597 4e-04 0.00115 0.00869
0.2 0.964 0.964 0.607 0.000871 0.00155 0.00758
Ensemble Model
Under Sampling
Because the positive/negative data is imbalanced(ratio is nearly 1:8), so use under-sampling technics to partition 8 training data sets and build 8 GBM models
Mixture Model
Blending them together to generate a predicted positive label probability
And set the cutoff to be 90% to improve the precision
Mixing with rule engine
In order to complement the recall sacrifice, we combined the rule back to get the final labels.
ROC comparison:
ModelModel DescriptionROCM1user demographics + pre_switch features0.6273M2M1 + post_switch features0.6558M3M2 + between_switch transition features0.9381M4M3 + meta session level features0.938M58 M4 Ensemble Model0.9696Variable Importance Chart:
Feature
Overall
mob_dev_cnt
100
pc_dev_cnt
71.4528
seq_gap_dur
24.3231
notif_pair_as_label
20.7925
tab_dev_cnt
15.6279
device_pair_cnt
9.9352
overlap_gap_dur
7.6
meta_categ_similarity_score
6.8111
ttl_dev_cnt
5.7397
sec_gap_pct
3.5903
sec_gap_cnt
2.308
to_sess_dur
1.5664
from_sess_dur
1.1491
leaf_gender_diff_score
1.0783
to_avg_female_pct
1.0188
is_buyer
0.5936
to_avg_male_pct
0.424
user_age
0.3801
from_avg_female_pct
0.3113
switch_hour_bucket
0.2983
- Account Sharing Classification Model
- LeNet: the MNIST Classification Model
- LeNet: the MNIST Classification Model
- 5.classification--Probabilistic Generative Model
- Supervised Learning Model-Classification Learning
- 删除OS X中的sharing-only user account
- Language Model与Naive Bayes Text Classification
- 分类(Classification):Probability Generative Model
- 机器学习基石-Linear Model for Classification
- account
- Account
- Account
- account
- account
- sharing
- Sharing
- Sharing
- Sharing
- 510D Fox And Jumping(dp+gcd)
- Win7环境变量下的用户变量和系统变量的区别
- Linux时间子系统之五:低分辨率定时器的原理和实现
- uva 11218 KTV(DFS+回溯)
- 使用Redis构建消息队列和发布订阅系统
- Account Sharing Classification Model
- 将博客搬至CSDN
- JS中单引号与双引号的一个区别
- Linux时间子系统之六:高精度定时器(HRTIMER)的原理和实现
- 马化腾:用户体验的10/100/1000法则.
- 转载大神IOS开发系列【10】--CALayer的使用
- js 得到Url参数值
- CMS - Configuration management service based on MongoDb
- spoon实战