Account Sharing Classification Model

来源:互联网 发布:南风知我意txt百度云 编辑:程序博客网 时间:2024/06/03 22:39

Author:Zhang Felix

Introduction

In the age of internet, online shopping has created a revolution for consumers, growing rapidly year by year and offering a Golden Age of shopping. User account sharing is quite a common phenomenon in online shopping. People are willing to share user account in all purposes.  Sellers are sharing their user accounts with employees to sell in a more efficient way. Buyers are sharing their user account with family, friends, roommates to enable more convenience user experience. In this paper, we present a logic based study to define account sharing behavior in eBay marketplace. Using data from a large eBay behavior data platform, we show that there are discernible and noteworthy patterns of account sharing behavior. We also build account sharing classification model to capture shared accounts using user demographic, pre-switch, post-switch, switch transition and meta session level features. Our findings show that our model can attain strong account sharing detection accuracy and improve personalization experience in eBay and enable analytics for device switch.

RELATED WORK

In this paper our focus is on identifying account sharing users. Related work falls in the following areas: (1) Connecting sessions in device switch. (2) Account sharing behavior classification.

Multi-screen usage is the key to the account sharing model. The concept of multi-screen has been raised by Google 2 years ago. Device switch has been studied intensively, behavior data from log server have proven to be extremely valuable in studying how people switch device to another. In our case of eBay marketplace, there are three device categories: desktop, cellphone, tablet. We are connecting these switching sessions from one device category to another. We call these sessions as “Meta Session” in this project. There are three types of meta session.

First is sequential, the session 2 starts after session 1 within 30 mins.

Second is overlapping, session 2 overlaps with session 1.

Third is subsuming, session1 contains session 2. 

Modeling

In order to build the account sharing classification model, we created 5 groups of features.

Feature Group 1 - User Demographic Features

Specification: This is a group of features which is related to basic attributions for eBay user.

  • Buyer/Seller Indicator
  • Gender
  • Age Group
  • Number of Children
  • Single/Family Indicator
  • Seller Level (CSS)

Feature Group 2 - Pre-Switch Features

Specification: This is a group of features related to pre-switch session.

  • Device category
  • Avg leaf male fraction
  • Avg leaf female fraction
  • Session Duration

Feature Group 3: Post-Switch Features

Specification: This is a group of features related to post-switch session.

  • Device category
  • Avg leaf male fraction
  • Avg leaf female fraction
  • Session duration

Feature Group 4 - Switch Transition Features

Specification: This is a group of features related to transition attribution between pre-switch session and post-switch session.

  • Distance
  • Moving Speed
  • Sequential gap duration
  • Overlap gap duration
  • Event Gap Count for 1 & 2 second threshold
  • Switch hour bucket
  • Device pair frequent count
  • Leaf gender variance score
  • Meta Category similarity Score
  • Notification pair indicator 

Feature Group 5 - Meta Session Features

Specification: This is a group of features related to overall account daily performance in the meta session.

  • # unique devices
  • # unique cellphones
  • # unique desktops
  • # unique tablets

We added feature groups step by step to train our GBM model, the following the Specificity vs Sensitivity plot.

M1: we use user demographic features + pre-switch session features

M2: we use M1 + post-switch feature

M3:we use M2 + between-switch transition feature

M4: we use M3 + meta session level features

M5: we use Ensemble model based on 8 M4 models

 

GBM Parameter Tuning

We have tuned the optimized model parameters by grid search,
finally, we got the best parameters: tree=350, depth=3 and shrinkage = 0.15,
so for the further model tuning, we will apply these 3 parameters to all the models.

# of Trees = 350:

Stochastic Gradient Boosting

67500 samples
24 predictors
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (3 fold)

Summary of sample sizes: 44999, 45001, 45000

Resampling results across tuning parameters:

n.trees ROC Sens Spec ROC SD Sens SD Spec SD
100 0.958 0.973 0.472 0.00064 0.000983 0.00284
150 0.961 0.971 0.512 0.000583 0.00121 0.0046
200 0.962 0.969 0.55 0.000406 0.000684 0.00861
250 0.963 0.968 0.568 0.000547 5.69e-05 0.0103
300 0.963 0.967 0.582 0.000506 0.000591 0.00743
350 0.964 0.966 0.595 0.000618 0.000402 0.00395
400 0.964 0.966 0.599 0.00063 0.000421 0.00409
450 0.964 0.966 0.599 0.000767 0.000427 0.00686
500 0.964 0.966 0.605 0.000733 0.000733 0.0112

 

# interaction.depth = 3:

67500 samples
24 predictors
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (3 fold)

Summary of sample sizes: 45000, 44999, 45001

Resampling results across tuning parameters:

interaction.depth ROC Sens Spec ROC SD Sens SD Spec SD
1 0.943 0.978 0.338 0.00238 0.00233 0.0177
3 0.963 0.965 0.597 0.00136 0.000994 0.00711
5 0.964 0.964 0.628 0.00159 0.00108 0.00653
7 0.964 0.963 0.629 0.00155 0.00158 0.00479

 

Shrinkage=0.15:

Stochastic Gradient Boosting

67500 samples
24 predictors
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (3 fold)

Summary of sample sizes: 45000, 44999, 45001

Resampling results across tuning parameters:

shrinkage ROC Sens Spec ROC SD Sens SD Spec SD
0.05 0.961 0.97 0.531 0.000171 0.00197 0.0166
0.1 0.963 0.967 0.588 0.000417 0.00118 0.0107
0.15 0.964 0.965 0.597 4e-04 0.00115 0.00869
0.2 0.964 0.964 0.607 0.000871 0.00155 0.00758

 

Ensemble Model

Under Sampling

Because the positive/negative data is imbalanced(ratio is nearly 1:8), so use under-sampling technics to partition 8 training data sets and build 8 GBM models

Mixture Model

Blending them together to generate a predicted positive label probability

And set the cutoff to be 90% to improve the precision

Mixing with rule engine

In order to complement the recall sacrifice, we combined the rule back to get the final labels.

ROC comparison:

ModelModel DescriptionROCM1user demographics + pre_switch features0.6273M2M1 + post_switch features0.6558M3M2 + between_switch transition features0.9381M4M3 + meta session level features0.938M58 M4 Ensemble Model0.9696

Variable Importance Chart:

Feature

Overall

mob_dev_cnt

100

pc_dev_cnt

71.4528

seq_gap_dur                

24.3231

notif_pair_as_label       

20.7925

tab_dev_cnt

15.6279

device_pair_cnt

9.9352

overlap_gap_dur

7.6

meta_categ_similarity_score

6.8111

ttl_dev_cnt

5.7397

sec_gap_pct

3.5903

sec_gap_cnt

2.308

to_sess_dur

1.5664

from_sess_dur

1.1491

leaf_gender_diff_score

1.0783

to_avg_female_pct

1.0188

is_buyer 

0.5936

to_avg_male_pct 

0.424

user_age                   

0.3801

from_avg_female_pct        

0.3113

switch_hour_bucket         

0.2983

0 0
原创粉丝点击