Deep Learning in Customer Churn Prediction (一) (提升平衡随机森林及特征构建)

来源：互联网发布：借乎审核打电话吗编辑：程序博客网时间：2024/06/07 00:27

Deep Learning in Customer Churn Prediction:

Unsupervised Feature Learning on Abstract Company Independent Feature Vectors

I. Introduction

A.Current Machine Learning Pipeline at Framed

如上图Framed现在使用的Customer Churn Prediction模型是随机森林，其能够解决过拟合的问题。

其存在的问题

Random Forests performance in prediction is highly dependent on the features that it is given.Without the capacity of engineering its own features to be able to better capture variance present in the data.(which would ultimately increase prediction accuracies). A lot of time is spent by Framed in generating secondary features that can do just that.This derivation(派生) and generation of meaningful secondary features become a struggle when this needs to happen for each and every company Framed provides its service to. This is because each company has its own unique features that exhibit their own variances and dependencies.

这里还提到维数灾难提升在实际应用中特征工程的重要性。

为后文引出深度学习特征工程做铺垫。

B.Dissolving Feature Engineering through Deep Learning

指出非监督神经网络对于大维特征提取的重要性。

DNNs(深度神经网络)不仅作为为后续套入处理模型增加精度的目的，

还可以避免手动选取特征的繁复。

在这里要指出的是在sklearn中所给出的受限波尔兹曼机(Restricted Boltzmann machine)

就是进行线性分类器非监督特征选择的工具之一。

文档是这样介绍的：

Restricted Boltzmann machine(RBM) are unsupervised nonlinear feature learners based on a probability model. The feature extracted by an RBM or a hierarchy of RBMs often give good results when fed into a linear classifier such as a linear SVM or a perceptron.

文档见

http://scikit-learn.org/stable/modules/neural_networks_unsupervised.html

C. Project Aim and Objectives

Project Aim就是上面提到的特征工程以提高精度并省略人工特征选取步骤，下面要描述的是DNNs在对于不同companys的数据进行特征选取时的形式统一优势。

Objectives就是要达到Project Aim的具体步骤，下面进行描述。

a) Generate an Encompassing Data Representation Architecture for Deep Learning Prediction:

对于这种数据结构的总体要求是在降维的同时保留对数据表征的高可识别度。

b) Implement an Appropriate Deep Learning Architecture for Churn Prediction:

这里再次提到要避免人工特征工程。

还提到模型的建立应当是近似“时不变”的。

The deep architecture should employ techniques to generalize well across months, without a lot of variance in prediction accuracy across months.

D.Paper Overview

II. BACKGROUND RESEARCH AND RELATED WORK

A.Churn Prediction Applications

用户流失一般分成两种情况：

Two type of churn:

Incidental and deliberate(故意的)

Due to life circumstances.(incidental churn)

Customer relocation(迁移)to areas where company dose not provide service to, or even changes in a customers financial status such that he/she can no longer afford to stay with a company.

Customer deciding to move their custom to a competitor(deliberate churn)

原因就是提供服务的对比。

下面是近几十年常用的churn prediction方法

1) Support Vector Machines

这里提到了软边界改善过拟合问题及核方法

先提一下SVM的两点优势：

仅仅需要控制两个超参数（软边际惩罚C及核方法超参数）

其求解总是全局最优的（较其它可能由优化导致的局部最优要强）

SVM在对现实数据及噪声数据集有适应性。

其强于Logistic回归但弱于随机森林。

下面提一下SVM的劣势：

结果对上述的两个超参数的选择敏感，并依赖于data features

训练时间较Logistic回归及随机森林要慢，这在大数据处理的背景下是其最大的缺点，对于大数据可扩展性不好。

2) Decision Trees and Random Forests:

其有如下两个缺点：

这里对单一决策树的评价是效果弱于其他分类器，

而且对于非平衡问题(分类训练样本量偏移)，总是倾向于刻画“有代表性的大多数”。

故引出了Bagging组合算法对应的随机森林。

随机森林不仅效果好（随机分割防过拟合），而且由于其聚合特征适用于分布式系统搭建。

但随机森林对于极端不平衡数据的效果仍然不好，而数据集的不平衡正是

churn prediction datasets的重要特征。（流失用户基本对应于奇异点）

这个固有瑕疵催生了the development of Improved Balanced Random Forests

下面是对上述Improved Balanced Random Forests的叙述，其中的Weighted random forests很有AdaBoost的味道

The proposed algorithm combined two previous attempts on tackling this issue, Balanced Forests and Weighted Forests. Balanced Forests work by sub sampling a datasets while balancing the samples in terms of class distribution for each tree. This is

Repeated until all trees generated have covered the majority class. Weighted random forests assign weights to each class, such that the weight of the majority class has a lower weight than the minority class in order to penalize on misclassification accordingly.

对上面Balanced Forests及Weights Forests的组合得到的算法在Chinese Bank数据上的表现与Artificial Neural Networks, Decision Trees, SVMs相较是一致最优的，但是并不能免于“严格的”特征工程。

B. Unsupervised Feature Learning

目的就是降维表征。

对人类视觉感知的模拟可以被区分成三个不同的结构。

Generative deep architecture

Used to describe the high-level correlation properties of the observed data for pattern analysis, and consequently describe the joint statistical distributions of the observed data with their associate classes.

(描述观察数据的相关特征)

Discrimination deep architecture

Used to directly classify patterns by describing previous distributions of classes given by the observed data.

(根据观察数据的特征进行分类)

Hybird deep architecture

Used for when the goal is to classify but is supported by the outcomes of a generative architecture. Usually these architecture have the highest prediction accuracy.

(结合generative architecture的分类)

由这段描述可以体会到深度神经网络较少隐藏层对于特征的抽象是对数据特征高度还原的原因。

C. Application of Deep Learning in Churn Prediction

现今深度学习在churn prediction方面的应用很少，发行的描述这方面问题的文章的问题来源于移动通信网络(mobile telecommunication network)。

该文章使用4层神经网络对于churner及non-churner进行二分类，其引入神经网络的原因是避免特征工程可能造成的时间损耗。

从效果来看，该模型基本上是“时不变”的，而且不过拟合，prediction accuracy from 73.2% to 77.9%以说明multi-layer feed forword models are effective in churn prediction.

该文提出了两个可能进一步提升预测精度的方法——将通信的地区作为标签变量加入、对于数据的非监督预处理刻画（Deep Belief Networks--generative architecture）

该文的不足——将模型限制在移动通信网络方面、并不能完全避免特征工程。

D. Diving Deeper into Deep Learning Mechanics

这里提到了一些监督学习及神经元的基本知识，

1) 2)

对于正规化(0-1间)数据tanh较logistic函数在反向传播过程中有更大梯度。（收敛更快）

Rectified linear activation function由于仅有两个常数导数，其求解速度快于tanh及logistic，而且由于其线性型避免了反向传播的梯度消失问题。

3) Deep Feed-Forward Neural Network(Multilayer Perceptron)

4) Tackling Overfitting in Deep Neural Networks

一些老生常谈（大家都会）

该文后面会涉及一些多机并行处理数据的工具及概念，且略过。

我们先来看一下上文提到的Improved balanced random forests的细节

To combine these two methods(balanced random forests and weighted random forests)

We introduced two “interval variables” m and d, where m is the middle point of an interval and d is the length of the interval. A distribution variable alpha is randomly generated between[m - d/2, m + d/2], which directly determines the distribution of samples from different classes for one iteration, The main reason for introducing these variables is to maintain the random distribution of different classes for each iteration,which results in higher noise tolerance.By contract, balanced random forests draw the same number of samples from both the majority and minority class so that the classes are represented equally in each tree. Thus, the different classes are no longer randomly distributed across different iterations, making the methods more vulnerable to noise.

该算法的实现可见下图：

组合训练后排一个序就能得到要的输出。

上图中对于算法的描述是很清晰的，借助python sklearn

可以简单地给出实现，

下面给出实现，并使用iris数据进行测试，为对比方便，仅使用iris的前两类数据，

相应y_label值分别为0 1，

为体现出不平衡，训练数据的0分类与1分类的比为40:2

下面是代码：

#coding: utf-8from __future__ import divisionfrom sklearn import treeimport numpy as npimport osdef generate_tree(X_positive, y_positive, X_negative, y_negative, alpha, m_try):    # X_positive is the dependent matrix of D+    # y_positive is the label vector of D+    # vice versa    single_tree = tree.DecisionTreeClassifier(max_features = m_try)    n = len(y_positive) + len(y_negative)    sample_index_positive = np.random.random_integers(0, X_positive.shape[0] - 1, size = int(n * alpha))    sample_index_negative = np.random.random_integers(0, X_negative.shape[0] - 1, size = int(n * (1 - alpha)))    X_require = np.append(X_positive[sample_index_positive], X_negative[sample_index_negative], axis=0)    y_require = np.append(y_positive[sample_index_positive], y_negative[sample_index_negative], axis=0)    w1 = 1-alpha    w2 = alpha    weight_array = np.append(np.ones(len(sample_index_positive)) * w2, np.ones(len(sample_index_negative)) * w1)    single_tree.fit(X_require, y_require, sample_weight=weight_array)    return single_treedef generate_bagging_tree_list(X_positive, y_positive, X_negative, y_negative, n_tree, m, d, m_try):    bagging_tree_list = []    for _ in range(n_tree):        alpha = np.random.rand() * d - d / 2 + m        bagging_tree_list.append(generate_tree(X_positive, y_positive, X_negative, y_negative, alpha, m_try))    return bagging_tree_listdef predict_negative_sort(X_input, X_positive, y_positive, X_negative, y_negative, n_tree, m, d, m_try):    bagging_tree_list = generate_bagging_tree_list(X_positive, y_positive, X_negative, y_negative, n_tree, m, d, m_try)    negative_record_list = []    for single_tree in bagging_tree_list:        negative_record_list.append(list(single_tree.predict(X_input)))    score_sum = np.sum(np.asarray(negative_record_list), axis=0)    return np.argsort(score_sum)if __name__ == "__main__":    from sklearn.datasets import load_iris    iris = load_iris()    split_index = 52    X = iris.data[:split_index]    y = iris.target[:split_index]    X_positive = X[10:50]    y_positive = y[10:50]    X_negative = X[50:]    y_negative = y[50:]    X_input = np.append(iris.data[:10], iris.data[split_index: 100], axis=0)    n_tree = 100    m = 0.5    d = 1    m_try = 3    negative_score_list = predict_negative_sort(X_input, X_positive, y_positive, X_negative, y_negative, n_tree, m, d, m_try)    print("negative score list :")    print(negative_score_list)

多次输出的结果基本上是稳定的

将两类区分的很好，读者可以一试。

在同样数据设定下决策树基本是不稳定的，最糟糕的情况错分一半，

随机森林（100个决策树）基本稳定，但一般50个会有8个是错的。

再看一下mobile telecommunication network的模型细节

该文以移动通信为例指出与用户行为相关的变量有100000个之多，同时指出了特征工程的困难。

来看一下该文有用的细节，

移动通信网络中传统的进行predicting churn的方法有两个：

Use graph processing techniques, identified already churned customers and analyzed a graph model to infer interactions with the current customers with the aim of predicting new churners based on these interactions.

Others predict customer churns by analysis the interactions between the customer and the Customer Relationship Management(CRM) data.

对于前者AP算法或概率图模型应当有用武之地。

将用户的进入与流失作为一个过程来看，可以将用户划分为4个状态，见下图

看到这个图，立刻想到了马尔科夫链，

这个直觉是正确的，移动通信领域的churn prediction问题也有人用马尔科夫链来研究，

见

http://www.acta.sapientia.ro/acta-econ/C3/econ3-04.pdf

下面是一些关于给文中构造指标的方法的介绍

细节参见原文

一些CDR（Call Detail Records）所包含的信息
Id Cell Start 输出方信号塔的位置
可以类比为地区因素
Number A 打出的电话号码
输出方唯一标识
Id Cell End 输入方信号塔的位置
地区因素
Number B 打入的电话号码
输入方唯一标识
Timestamp 进行呼叫的开始时间
类比为交易时间
Duration 以秒记的持续时间
类比为持续交易时间
IMEL 电话终端的唯一识别数
Type 是打电话还是接电话

对于通信问题balance replenishment event(平衡补给事件）应当包含的一些信息
这里的balance replenishment event可以看做是响应正在进行相关操作。（以保持存在的状态）

• Number: Phone number related to the balance replenishment event.
• Timestamp: Time-stamp of the balance replenishment event.
• Amount: Amount of money the customer spent in the balance replenishment event.

这里对于每个用户每个月设定一个输入vector
包含calls event 及balance replenishment history

将全部（不分类别的用户放入模型首先必定是不可取的）
首先应当尝试的应该是对某一个分组类型的数据进行建模
该文中选取了所谓top-3 users
top-k算法在搜索引擎中的使用是较多的，如要得到搜索最热门的k个关键词。
可用某种共性将用户绑在一起。

该文构造了48维vector X将总长30分钟的通信时间进行分割并将用户
在一个月内各个通信时间的时长进行加总构成向量
如文中提到的这类似于Bag-of-Words的建模方式（多用于文本挖掘）
又增加了用于表征该月花费现金的量。
并将user在下一个月是否为active作为二进制值作为输入。

如果这些作为输入信息是给定的之后构建DNNs的方法基本上就可以明确了。

故从上面的叙述可以尝试从提升平衡随机森林、深度多层神经网络（DNN）及RNN入手

尝试解决Churn Prediction问题

关于使用RNN来刻画该类问题的模型介绍可参看

Deep Learning in Customer Churn Prediction (二) (WTTE模型 ——what time to the next event)

关于上述特征构建及模型实测的示例代码及简要讨论见

Deep Learning in Customer Churn Prediction (三) (初步特征构建实践及基本模型试验)

To be continued.

阅读全文

0 0