翻译：Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

来源：互联网发布：天干地支算法年月日编辑：程序博客网时间：2024/05/22 01:39

使用RNN Encoder-Decoder模型学习统计机器翻译的短语表示

原文：《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》
原作者：Kyunghyun Cho 等

注：本文是较早提出seq2seq模型的经典论文，翻译一下大家共同学习

摘要

本文提出了一种新的神经网络模型，即RNN Encoder-Decoder，由两个神经网络（RNN）组成。一个RNN将一个符号序列（sequence of symbols）编码成一个固定长度的向量表示，另一个则将这个表示解码成另一个符号序列。该模型的编码器和解码器被联合训练（jointly trained），以最大化给定源序列的目标序列的条件概率（conditional probability）。经实验发现，利用RNN Encoder-Decoder计算出短句对（phrase pairs）的条件概率并将其作为现有对数线性模型的一个额外特征值，可提高统计机器翻译系统的性能。定性地，我们提出的模型能学习语言短句的表达且其在语义上和句法上都是有意义的（Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.）

1 介绍

深度神经网络在不同的应用中取得了巨大的成功，例如目标识别和语音识别。此外，最近的许多研究表明，神经网络可以在自然语言处理(NLP)的许多任务中被成功使用，包括但不限于：语言建模、释义检测和词嵌入提取。在统计机器翻译(SMT)领域，深度神经网络已经开始显示出有希望的结果。Schwenk(2012)总结了在基于短语的SMT系统框架下，前馈神经网络（feedforward neural network）的成功使用。
本文针对使用神经网络进行SMT研究，重点研究了一种新颖的神经网络结构，可用作传统的基于短语的SMT系统的一部分。我们将所提出的神经网络架构称为RNN Encoder–Decoder，由两个编码器-解码器对的循环神经网络（RNN）组成。编码器将可变长度的源序列映射到一个固定长度的向量，解码器将该向量映射回可变长度的目标序列。两个网络将被联合训练以最大化给定源序列的目标序列的条件概率。此外，我们建议使用一个复杂的隐层单元，以提高内存容量和训练的容易性。
通过将其短语分数与现有翻译模型给出的短语分数进行比较，我们定性分析了已训练的RNN Encoder-Decoder。定性分析表明，RNN Encoder-Decoder能更好地捕捉短语表中的语言规律，这间接解释了整体翻译性能的量化改进。对模型的进一步分析显示，RNN Encoder-Decoder可以学习一个短语的连续空间表示，同时保留该短语的语义和句法结构。

2 RNN Encoder–Decoder

2.1 初步：循环神经网络

循环神经网络（RNN）是一个包含隐藏状态h和可变输出y的神经网络，其可操作于可变长度序列x =(x1,...,xT).。在每个时间步长t，RNN的隐藏状态h<t>按照如下公式更新：

h < t > = f (h < t - 1 >, x t) (1)

其中

f是非线性激活函数。

f可以像logistic sigmoid function一样简单，也可以像long short-term memory（LSTM）一样复杂。
通过训练预测序列中的下一个symbol，RNN可以学习序列上的概率分布。在这种情况下，每个时间步长t的输出是条件分布

p(xt|xt−1,...,x1)。例如，对于所有

j=1,...,K，可以使用softmax激活功能输出多项式分布（1-K编码）

p (x t, j = 1 | x t - 1, . . ., x 1) = e x p ( w j h < t > ) \sum K j ' = 1 e x p ( w j ' h < t > ) (2)

其中

wj是权重矩阵

W的行。通过组合这些概率，我们可以用如下公式计算序列x的概率:

p (x) = \prod t = 1 T p (x t | x t - 1, . . ., x 1) (3)

从这个学习的分布中，通过在每个时间步长迭代采样符号来直接抽样新序列。

2.2 RNN Encoder–Decoder

在本文中，我们提出了一种新颖的神经网络架构，它可以学习将可变长度的序列编码为固定长度的向量表示，并将给定的固定长度向量表示解码回可变长度序列。从概率的角度来看，这种新模式是一种通用方法，可学习一个可变长度序列在另一个可变长度序列下的条件分布，e.g. p(y1,...,yT′|x1,...,xT)，值得注意的是代表了输入序列长度的T和输出序列长度T′可以不一样。
编码器是一个RNN，依次读取输入序列x的每个符号。当它读取每个符号时，RNN的隐层状态会根据公式（1）改变。当读到序列的结尾（由end-of-sequence符号标记）后，RNN隐层状态将会是整个输入序列的summary c。
模型的解码器是另一个RNN，通过预测给定隐层状态h<t>的下一个symbal yt，解码器可被训练输出序列。然而，与2.1所述的RNN不同，yt和ht都受制于yt−1和输入序列的summary c。因此，在t时刻解码器的隐层状态是由

h < t > = f (h < t - 1 >, y t - 1, c)

计算得出，类似的，下一个symbol的条件分布为：

P (y t | y t - 1, y t - 2, . . ., t 1, c) = g (h < t - 1 >, y t - 1, c)

f与

g为给定的激活函数（后者必须能生成有效的概率，比如利用softmax）。

图1：RNN Encoder-Decoder图例

RNN Encoder-Decoder的两个组件被联合训练以最大化条件对数似然

max θ 1 N \sum n = 1 N l o g p θ (y n | x n) (4)

其中

θ是模型参数的集合，每个

(xn,yn)是来自训练集的(输入序列，输出序列)对。在我们的案例中，从输入到解码器的输出都是可微的，所以我们可以使用基于梯度的算法来估计模型参数。
RNN Encoder-Decoder一旦被训练，模型可以用两种方式使用。一种是使用模型来生成给定输入序列的目标序列。另一种该模型可以用于对给定的输入输出序列对进行评分，评分是来自公式(3)(4)的概率

pθ(y|x)。

2.3 隐层单元的适当记忆与遗忘

除了新颖的模型架构，我们还提出了一种新型隐层单元（公式(1)中的f），受LSTM单元的启发但计算和实现要简单得多。图2显示了该隐藏单元的图形描述。
这里写图片描述图2：隐层激活函数的图示。更新门z决定h是否要与新隐层状态h̃ 一同更新。复位门r决定先前的隐藏状态是否被忽略。公式 (5)-(8)有r，z，h和h̃ 的详细公式。
让我们来说说第j个隐层单元是如何计算的。首先，复位门rj由以下公式计算：

r j = σ ([W r x] j + [U r h < t - 1 >] j) (5)

其中

σ为logistic sigmoid函数，

[.]j代表一个向量的第

j个元素。x和

ht−1分别为输入和前一个隐层状态。

Wr和

Ur为已学习的权重矩阵。
类似的，更新门

zj由以下公式计算：

z j = σ ([W z x] j + [U z h < t - 1 >] j) (6)

单元

hj的激活由以下公式计算

h < t > j = z j h < t - 1 > j + (1 - z j) h ̃ t j (7)

其中

h ̃ < t > j = ϕ ([W x] j + [U (r ⊙ h < t - 1 >)] j) (8)

在公式中，当复位门接近0时，隐层状态将会强制忽略前一个状态并且只复位当前输入。这允许隐层状态可有效地丢弃在将来会被发现不相关的信息，从而使表示更加紧凑。
另一方面，更新门控制有多少信息可以从前状态转移到当前状态。这类似于LSTM网络中的存储单元（memory cell），并帮助RNN记住long-term的信息。此外，这可以被认为是leaky-integration unit
的变体。
由于每个隐藏单元具有单独的复位和更新门，每个隐层单元将学习捕获不同时间范围的依赖关系。学习捕获短期依赖关系的单元往往会有定期活跃的复位门，但捕获长期依赖关系的单元将具有经常活跃的更新门。
在我们的初步实验中，我们发现使用这种具有门控单元的新单元是至关重要的。没有gating，用

tanh将得不到任何有意义的结果。

3 统计机器翻译

在通常使用的统计机器翻译系统（SMT）中，本系统（具体来说为解码器）的目标是找到给定源语句e的翻译f，最大化如下公式：

p (f | e) \propto p (e | f) p (f)

实际使用中，大部分SMT系统将

logp(f | e)建模为对数线性模型外加额外特征值和相关权重：

l o g p (f | e) = \sum n = 1 N w n f n (f, e) + l o g Z (e) (9)

其中

wn和

fn分别为第

n个特征值和权重。

Z(e)是不依赖于权重的归一化常数。在开发中，权重常常被优化以最大化BLEU分数。
在(Koehn et al., 2003)和(Marcu and Wong, 2002)介绍的基于短语的SMT框架中，翻译模型

logp(f | e)被分解为源和目标句子中匹配短语的翻译概率。这些概率将再次作为对数线性模型中的附加特征（参见公式(9)），并相应地加权以使BLEU得分最大化。
由于神经网络语言模型(Bengio et al., 2003)的提出，神经网络已被广泛应用于SMT系统。在许多情况下，神经网络已经被用来重打分(rescore)翻译假设(n-best lists)（参见(Schwenk et al., 2006)）。然而最近，使用源句子的表示作为附加输出来训练神经网络为翻译的句子（或短句对）打分逐渐引起人们的兴趣。参见(Schwenk, 2012), (Son et al., 2012) 和 (Zou et al., 2013)。

3.1 利用RNN Encoder–Decoder为短句对打分

这里我们建议，调整SMT的Encoder时，在短语对表（table of phrase pairs）上训练RNN Encoder–Decoder（见第2.2节）并使用它的得分作为等式(9)中对数线性模型的附加特征。
当我们训练RNN Encoder–Decoder时，我们忽略原始语料库中每个短语对的（归一化）频率。这个措施是为了：(1)减少根据归一化频率从大短语表中随机选择短语对的计算费用。（2）确保RNN Encoder–Decoder不是简单地学习根据短语对的出现次数来对它们打分。一个根本原因是短语表中现有的翻译概率已经反映了原始语料库中短语对的频率。RNN Encoder–Decoder具有固定的容量(capacity)，我们试图确保模型的大部分容量集中在学习语言规律性上，例如区分合理和不可信的翻译，或者学习合理（plausible）翻译的“多样性”（概率集中区域）。
一旦RNN Encoder–Decoder训练完成，我们就为每个短句对在现有短句表上加一个新分数。这允许新的分数进入现有的调整算法，以及最小的额外计算开销。
正如Schwenk在(Schwenk，2012)指出的那样，可以用提出的RNN Encoder-Decoder完全替代现有的短语表。在这种情况下，对于给定的源短语，RNN Encoder-Decoder将需要生成（良好）目标短语的列表。然而，这需要重复执行昂贵的采样程序。因此，在本文中我们只考虑在短语表中对短语对进行重打分。

3.2 相关方法：机器翻译神经网络

在介绍实证结果之前，我们讨论了一些最近在SMT上下文中使用神经网络的作品。
（注：以下相当于扩展阅读，暂不翻译）
Schwenk in (Schwenk, 2012) proposed a similar approach of scoring phrase pairs. Instead of the RNN-based neural network, he used a feedforward neural network that has fixed-size inputs (7 words in his case, with zero-padding for shorter phrases) and fixed-size outputs (7 words in the target language). When it is used specifically for scoring phrases for the SMT system, the maximum phrase length is often chosen to be small. However, as the length of phrases increases or as we apply neural networks to other variable-length sequence data, it is important that the neural network can handle variable-length input and output. The proposed RNN Encoder–Decoder is well-suited for these applications.
Similar to (Schwenk, 2012), Devlin et al. (Devlin et al., 2014) proposed to use a feedforward neural network to model a translation model, however, by predicting one word in a target phrase at a time. They reported an impressive improvement, but their approach still requires the maximum length of the input phrase (or context words) to be fixed a priori.
Although it is not exactly a neural network they train, the authors of (Zou et al., 2013) proposed to learn a bilingual embedding of words/phrases. They use the learned embedding to compute the distance between a pair of phrases which is used as an additional score of the phrase pair in an SMT system.
In (Chandar et al., 2014), a feedforward neural network was trained to learn a mapping from a bag-of-words representation of an input phrase to an output phrase. This is closely related to both the proposed RNN Encoder–Decoder and the model proposed in (Schwenk, 2012), except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed in (Gao et al., 2013) as well. Earlier, a similar encoder–decoder model using two recursive neural networks was proposed in (Socher et al., 2011), but their model was restricted to a monolingual setting, i.e. the model reconstructs an input sentence. More recently, another encoder–decoder model using an RNN was proposed in (Auli et al., 2013), where the decoder is conditioned on a representation of either a source sentence or a source context.
One important difference between the proposed RNN Encoder–Decoder and the approaches in (Zou et al., 2013) and (Chandar et al., 2014) is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.
The closest approach related to the proposed RNN Encoder–Decoder is the Recurrent Continuous Translation Model (Model 2) proposed in (Kalchbrenner and Blunsom, 2013). In their paper, they proposed a similar model that consists of an encoder and decoder. The difference with our model is that they used a convolutional n-gram model (CGM) for the encoder and the hybrid of an inverse CGM and a recurrent neural network for the decoder. They, however, evaluated their model on rescoring the n-best list proposed by the conventional SMT system and computing the perplexity of the gold standard translations.

4 实验

（注：此章节为相关实验结果，暂不翻译，请查阅原文）
我们在第14届WMT研讨会上用英文/法文翻译任务评估了我们方法。

4.1 数据与基线系统

Large amounts of resources are available to build an English/French SMT system in the framework of the WMT’14 translation task. The bilingual corpora include Europarl (61M words), news commentary (5.5M), UN (421M), and two crawled corpora of 90M and 780M words respectively. The last two corpora are quite noisy. To train the French language model, about 712M words of crawled newspaper material is available in addition to the target side of the bitexts. All the word counts refer to French words after tokenization.
It is commonly acknowledged that training statistical models on the concatenation of all this data does not necessarily lead to optimal performance, and results in extremely large models which are difficult to handle. Instead, one should focus on the most relevant subset of the data for a given task. We have done so by applying the data selection method proposed in (Moore and Lewis, 2010), and its extension to bitexts (Axelrod et al., 2011). By these means we selected a subset of 418M words out of more than 2G words for language modeling and a subset of 348M out of 850M words for training the RNN Encoder–Decoder. We used the test set newstest2012 and 2013 for data selection and weight tuning with MERT, and newstest2014 as our test set. Each set has more than 70 thousand words and a single reference translation.
For training the neural networks, including the proposed RNN Encoder–Decoder, we limited the source and target vocabulary to the most frequent 15,000 words for both English and French. This covers approximately 93% of the dataset. All the out-of-vocabulary words were mapped to a special token ([UNK]).
The baseline phrase-based SMT system was built using Moses with default settings. This system achieves a BLEU score of 30.64 and 33.3 on the development and test sets, respectively (see Table 1).
（以下略）

5 结论

在本文中，我们提出了一种新型神经网络架构，称为RNN Encoder-Decoder，它能学习将一个任意长度的序列映射到另一个任意长度序列（可能来自不同数据集）。所提出的RNN Encoder-Decoder能够给一对序列（以条件概率表示）打分或产生给定源序列的目标序列。随着新架构，我们提出了一个新颖的hidden unit，包括一个复位门和一个更新门，自适应地控制每个hidden unit在读取/生成序列时记住或遗忘多少。
我们用统计机器翻译的任务评估了所提出的模型，其中我们使用RNN Encoder-Decoder对短语表中的每个短语对进行评分。定性上，我们能够表明新模型能够很好地捕获短语对中的语言规律，并且RNN Encoder-Decoder能够生成符合语法规则的目标短语。
我们发现RNN Encoder-Decoder可以在BLEU分数方面提高整体翻译效果。此外，我们发现RNN Encoder-Decoder的贡献正交于（orthogonal）SMT系统中神经网络的现有使用方法，因此我们可以同时使用RNN Encoder- Decoder和神经网络语言模型来提高性能。
我们对训练完的模型的定性分析表明，它确实捕捉了多个层面的语言规律，例如词级和短语层面。这表明可能会有更多自然语言相关的应用受益于RNN Encoder-Decoder。
该架构具有巨大潜力可进一步改进和分析。本文未研究的其中一种方法是让RNN Encoder-Decoder生成目标短语来代替整个或部分短语表。另外，值得注意的是此模型并不局限于用在书面语言上，所以将该架构应用于语音转录等其他应用将是未来重要的研究。

阅读全文

0 0