wide & deep论文-----2016.6.24

来源：互联网发布：软件质量保证措施方案编辑：程序博客网时间：2024/05/01 11:00

摘要

具有非线性特征变换的广义线性模型被广泛应用于具有稀疏输入的大规模回归和分类问题。通过广泛的跨产品特性转换，特征交互记忆是有效的和可解释的。然而泛化性能要求更多的特征工程。
基于少量的特征工程，通过对稀疏特征转换为低维密集的embedding，深度神经网络对于新出现的特征组合具有更好的泛化性能。然而深度圣经网络容易过度泛化，当user-item 交互很稀疏和高秩时，从而推荐很少具有相关性的产品。
本文，我们介绍了一种wide &deep学习-联合训练宽度线性模型和深度神经网络-组合了memorization和generalization。我们在Google Play上退出和验证了这个模型，Google play是一个商业的APP商店，有超过十亿的活跃用户和一亿的APP。在线实验结果展示了该模型相比与单独使用wide和单独使用dnn，显著的提高了APP的收益。
监督学习

1. 介绍

一个推荐系统可以被视为搜索排序系统，输入请求是用户集合和语境信息，输出是产品列表排序。给定一个查询，推荐任务是从数据集中找到相关的产品，然后基于一定的目标排序，比如点击或购买。
推荐系统存在的挑战与一般的搜索排序问题相似，是要兼顾memorization
and generalization。
Memorization 大致定义为学习产品和特征的共现性，探索在历史数据中可获得的相关性.
Generalization, 基于相关性的传递和探索那些很少或几乎没有在历史数据中出现的新的特征组合。基于memorization 的推荐通常是更主题化和与用户之前喜欢的产品直接相关的。generalization趋向于提高多样性。

工业界的大规模推荐系统中，LR因为简单和可解释用的更加广泛。该模型通常用于经过one-hot编码的二值数据。稀疏特征的跨产品组合可以有效的被记忆。这很好的解释与目标标签相关的特征对的共现 .泛化效果得到产生通过更细粒度的特征, such as AND(user_installed_category=video,
impression_category=music), 但是需要大量的人工特征。跨产品转换的一个限制是，它们不能推广到没有出现在训练数据中的查询项特征对。

Embedding-based模型, 比如因子机或深度神经网络, 通过学习低维稠密的向量能够泛化到之前没有出现过的query-item对，只需要很少的特征工程。然而，当底层query-item矩阵稀疏且高秩时，很难学习有效的低维表示，例如具有特定偏好的用户或具有很少吸引力的项目。这种情况下 query-item pairs应该是没有交互的,但是稠密embeddings 将产生非零的预测因此过度泛化推荐很少相关性的产品。另一方面线性模型因为跨产品特征转换却能用很少的参数记住 “exception rules” 。

2.推荐系统综述

the app recommender system . 一次查询可以包括不同用户和语境特征, 推荐系统返回排序列表

一亿的APP, 每次查询都要计算所有的APP评分是困难的，服务延时要求 (通常o（10）毫秒)。因此收到请求第一步是检索，检索系统返回一个基于各种信号对于查询最好匹配的短列表，通常实现方式是机器学习模型和人工定义排序规则。生成候选范围之后, 排序系统通过分数对他们进行排序. 分数通常是条件概率P (y|x), 给定特征下用户行为标签的概率, 包括用户的特征 (e.g., country, language,demographics), 语境特征 (e.g., device, hour of the day, day of the week), and 影响力特征 (e.g., app age,
historical statistics of an app).本文我们主要是用WIDE&DEEP框架实现排序。

3. wide & deep learning

The Wide Component

wide部分是广义线性模型，形式为 y = w T x + b，y 是预测, x = [x 1 , x 2 , …, x d ] 特征向量, w =[w 1 , w 2 , …, w d ] 模型参数， b偏置。特征集合包括行输入特征和转换特征. 最重要的转换是跨产品转换, which is defined as:
For binary features, a cross-product transformation (e.g.,
“AND(gender=female, language=en)”) is 1 if and only if the
constituent features (“gender=female” and “language=en”)
are all 1, and 0 otherwise.
二值特征的交互捕获, 增加了广义线性模型的非线性。

The Deep Component

深度部分是一个前馈的神经网络。对于类别特征For categorical features, 原始输入是特征字符串 (e.g., “language=en”). 这些稀疏高维的特征首先被转换成低维的稠密向量, 被称为 embedding向量.。embeddings的维度通常处于O(10) to O(100). The embedding 向量被随机初始化然后通过训练值最小化最终的损失函数。低维稠密的 embedding vectors输入都隐藏层。

Joint learning

宽度部分和深度部分通过权重组合求和输出一个对数概率作为预测
输入到logistic loss function for joint training.
Note 区别 joint training 和 ensemble. In an ensemble, 单个模型被分开训练, 仅仅在预测时进行组合. 相反, joint training
同时优化所有的参数，包括深度、宽度以及求和权重都被考虑进训练时间. 模型尺寸方面: ensemble, 因为训练是单独的, 为了达到一个合理的准确率，每个独立的模型尺寸通常需要很大 (e.g., with more features and transformations) . 相较而言, for joint training宽度部分仅仅利用少量的跨产品特征转换来弥补深度部分的不足，而不是全尺寸的宽度模型。
Joint training 采用mini-batch梯度反向传播，优化器wide部分用 Follow-
the-regularized-leader (FTRL) with L 1 正则化，深度部分用 AdaGrad。
P (Y = 1|x) = σ(w wide[x, φ(x)] + w deep a (l f ) + b)

4. 系统实现

APP推荐系统实现需要三步：数据生成、模型训练、模型服务

数据生成

一个时间段内的用户和产品impression数据用于产生训练数据。
Each examplecorresponds to one impression.
The label is app acquisition:1 if the impressed app was installed, and 0 otherwise.
Vocabularies, which are tables mapping categorical fea-
ture strings to integer IDs, are also generated in this stage.
The system computes the ID space for all the string features
that occurred more than a minimum number of times. Con-
tinuous real-valued features are normalized to [0, 1] by map-
ping a feature value x to its cumulative distribution function
P (X ≤ x), divided into n q quantiles. The normalized value
for values in the i-th quantiles. Quantile boundaries are computed during data generation.

模型训练

During training, our input layer takes in training
data and vocabularies and generate sparse and dense fea-
tures together with a label. The wide component consists
of the cross-product transformation of user installed apps
and impression apps. For the deep part of the model, A 32-
dimensional embedding vector is learned for each categorical
feature. We concatenate all the embeddings together with
the dense features, resulting in a dense vector of approxi-
mately 1200 dimensions. The concatenated vector is then
fed into 3 ReLU layers, and finally the logistic output unit.
The Wide & Deep models are trained on over 500 billion
examples.每次有新的训练集输入，模型都需要重新训练，需要额外的计算和延迟（从数据输入到新的模型用于服务）。为了解决这个问题，实现了热启动系统，新模型初始化with the embeddings and the linear model weights from the previous model.Before loading the models into the model servers, a dry run of the model is done to make sure that it does not cause problems in serving live traffic.

Model Serving

模型经过训练和验证将被加载到模型服务器上。对每一次请求，服务器从APP检索系统接受一个候选列表和用户特征，对每个APP打分。 Then, the 从高到低排序。得分是通过运行前馈模型得到的。利用小批量多线程并行化处理每个请求 10 ms级别

实验

APP获取量

线下：AUC Wide & Deep has a slightly higher offline
AUC
线上：A/B test the impact is more significant on online traffic.
One
possible reason is that the impressions and labels in offline
data sets are fixed, whereas the online system can generate
new exploratory recommendations by blending generaliza-
tion with memorization, and learn from new user responses.

服务性能

挑战：高流量和低延迟
流量高峰每秒1000万的APP
单线程批处理 takes 31 ms.
多线程小批次降低客户端延迟14 ms (including serving overhead)