记忆网络之End-To-End Memory Networks

来源：互联网发布：漂亮的ava女演员知乎编辑：程序博客网时间：2024/05/21 15:00

记忆网络之End-To-End Memory Networks

这是Facebook AI在Memory networks之后提出的一个更加完善的模型，前文中我们已经说到，其I和G模块并未进行复杂操作，只是将原始文本进行向量化并保存，没有对输入文本进行适当的修改就直接保存为memory。而O和R模块承担了主要的任务，但是从最终的目标函数可以看出，在O和R部分都需要监督，也就是我们需要知道O选择的相关记忆是否正确，R生成的答案是否正确。这就限制了模型的推广，太多的地方需要监督，不太容易使用反向传播进行训练。因此，本文提出了一种end-to-end的模型，可以视为一种continuous form的Memory Network，而且需要更少的监督数据便可以进行训练。论文中提出了单层和多层两种架构，多层其实就是将单层网络进行stack。我们先来看一下单层模型的架构：

单层 Memory Networks

单层网络的结构如下图所示，主要包括下面几个模块：

模型主要的参数包括A,B,C,W四个矩阵，其中A,B,C三个矩阵就是embedding矩阵，主要是将输入文本和Question编码成词向量，W是最终的输出矩阵。从上图可以看出，对于输入的句子s分别会使用A和C进行编码得到Input和Output的记忆模块，Input用来跟Question编码得到的向量相乘得到每句话跟q的相关性，Output则与该相关性进行加权求和得到输出向量。然后再加上q并传入最终的输出层。接下来详细介绍一下各个模块的原理和实现（这里跟论文中的叙述方式不同，按照自己的理解进行介绍）。

输入模块

首先是输入模块（对应于Memory Networks那篇论文的I和G两个组件），这部分的主要作用是将输入的文本转化成向量并保存在memory中，本文中的方法是将每句话压缩成一个向量对应到memory中的一个slot（上图中的蓝色或者黄色竖条）。其实就是根据一句话中各单词的词向量得到句向量。论文中提出了两种编码方式，BoW和位置编码。BoW就是直接将一个句子中所有单词的词向量求和表示成一个向量的形式，这种方法的缺点就是将丢失一句话中的词序关系，进而丢失语义信息；而位置编码的方法，不同位置的单词的权重是不一样的，然后对各个单词的词向量按照不同位置权重进行加权求和得到句子表示。位置编码公式如下：lj就是位置信息向量（这部分可以参考我们后面的代码理解）。

此外，为了编码时序信息，比如Sam is in the bedroom after he is in the kitchen。我们需要在上面得到mi的基础上再加上个矩阵对应每句话出现的顺序，不过这里是按反序进行索引。将该时序信息编码在Ta和Tc两个矩阵里面，所以最终每句话对应的记忆mi的表达式如下所示：

输出模块

上面的输入模块可以将输入文本编码为向量的形式并保存在memory中，这里分为Input和Output两个模块，一个用于跟Question相互作用得到各个memory slot与问题的相关程度，另一个则使用该信息产生输出。

首先看第一部分，将Question经过输入模块编码成一个向量u，与mi维度相同，然后将其与每个mi点积得到两个向量的相似度，在通过一个softmax函数进行归一化：

pi就是q与mi的相关性指标。然后对Output中各个记忆ci按照pi进行加权求和即可得到模型的输出向量o。

Response模块

输出模块根据Question产生了各个memory slot的加权求和，也就是记忆中有关Question的相关知识，Response模块主要是根据这些信息产生最终的答案。其结合o和q两个向量的和与W相乘在经过一个softmax函数产生各个单词是答案的概率，值最高的单词就是答案。并且使用交叉熵损失函数最为目标函数进行训练。

多层模型

其实就是将多个单层模型进行stack在一块。这里成为hop。其结构图如下所示：

首先来讲，上面几层的输入就是下层o和u的和。至于各层的参数选择，论文中提出了两种方法（主要是为了减少参数量，如果每层参数都不同的话会导致参数很多难以训练）。

Adjacent：这种方法让相邻层之间的A=C。也就是说Ak+1=Ck，此外W等于顶层的C，B等于底层的A，这样就减少了一半的参数量。
Layer-wise（RNN-like)：与RNN相似，采用完全共享参数的方法，即各层之间参数均相等。Ak=…=A2=A1,Ck=…=C2=C1。由于这样会大大的减少参数量导致模型效果变差，所以提出一种改进方法，即令uk+1=Huk+ok，也就是在每一层之间加一个线性映射矩阵H。

TensorFlow实现

为了更好的理解模型原理，其实最好的方法就是将其实现一遍。由于本模型15年发表，所以github上面已经有了很多实现方案，所以我们就参考其中两个分别来介绍QA任务的bAbI和语言建模的ptb。

bAbI QA建模

这部分代码参考https://github.com/domluna/memn2n，先简单介绍一下数据集，bAbI是facebook提出的，里面包含了20种问题类型，分别代表了不同的推理形式。如下所示：

在本试验中，我们会将这些句子和Question作为模型输入进行建模，希望模型可以学习出这种推理模式。下面我们主要看一下数据处理和模型构建部分的代码。我们以task1为例，先来看一下数据格式：

可以看出基本上是两句话后面跟一个问句，并且给出相应答案。答案后面的数字意味着该问题与哪一行相关（这个数据在Memory Networks中需要使用，但在本模型中弱化了监督的问题，所以不需要）。然后15行组成一个小故事，也就是说这15行内的数据都是相关的，后面的15个组成另外一组数据。所以memory_size也是10（15行中有10行是数据，5行是问题）。另外每个句子的组大长度是7。所以处理完之后的数据应该时15*7的矩阵。而且每15行数据会被处理成5组训练样本。第一组是前两行数据加问题和答案，第二个是前四行数据家问题和答案，这样继续下去。也就是说后面的问题是依据前面所有的数据回答的。数据处理的代码如下所示，主要关注parse_stories这个函数即可，实现了数据转化的功能，参考代码注释理解。

    def load_task(data_dir, task_id, only_supporting=False):        '''Load the nth task. There are 20 tasks in total.        Returns a tuple containing the training and testing data for the task.        '''        #读取文件并返回处理之后的数据        assert task_id > 0 and task_id < 21        files = os.listdir(data_dir)        files = [os.path.join(data_dir, f) for f in files]        s = 'qa{}_'.format(task_id)        train_file = [f for f in files if s in f and 'train' in f][0]        test_file = [f for f in files if s in f and 'test' in f][0]        train_data = get_stories(train_file, only_supporting)        test_data = get_stories(test_file, only_supporting)        return train_data, test_data    def tokenize(sent):        '''Return the tokens of a sentence including punctuation.        >>> tokenize('Bob dropped the apple. Where is the apple?')        ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']        '''        return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()]    def parse_stories(lines, only_supporting=False):        '''Parse stories provided in the bAbI tasks format        If only_supporting is true, only the sentences that support the answer are kept.        '''        data = []        story = []        for line in lines:            line = str.lower(line)            nid, line = line.split(' ', 1)            nid = int(nid)            #nid是每一行的序号（1~15之间），如果等于1则说明是一个新故事的开始，需要将story数组重置。            if nid == 1:                story = []            if '\t' in line: # 如果有\t，则说明是问题行                q, a, supporting = line.split('\t')                q = tokenize(q)                #a = tokenize(a)                # answer is one vocab word even if it's actually multiple words                a = [a]                substory = None                # remove question marks                if q[-1] == "?":                    q = q[:-1]                #判断是否需要记录相应信息行，即上面说到的数字信息。本文不需要                if only_supporting:                    # Only select the related substory                    supporting = map(int, supporting.split())                    substory = [story[i - 1] for i in supporting]                else:                    # Provide all the substories                    substory = [x for x in story if x]                data.append((substory, q, a))                story.append('')            else: # regular sentence                # remove periods                sent = tokenize(line)                if sent[-1] == ".":                    sent = sent[:-1]                story.append(sent)        return data    def get_stories(f, only_supporting=False):        '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.        If max_length is supplied, any stories longer than max_length tokens will be discarded.        '''        with open(f) as f:            return parse_stories(f.readlines(), only_supporting=only_supporting)    def vectorize_data(data, word_idx, sentence_size, memory_size):        """        Vectorize stories and queries.        If a sentence length < sentence_size, the sentence will be padded with 0's.        If a story length < memory_size, the story will be padded with empty memories.        Empty memories are 1-D arrays of length sentence_size filled with 0's.        The answer array is returned as a one-hot encoding.        """        #将单词转化为vocab中的索引方便进行embedding lookup。        S = []        Q = []        A = []        for story, query, answer in data:            ss = []            for i, sentence in enumerate(story, 1):                ls = max(0, sentence_size - len(sentence))                ss.append([word_idx[w] for w in sentence] + [0] * ls)            # take only the most recent sentences that fit in memory            ss = ss[::-1][:memory_size][::-1]            # Make the last word of each sentence the time 'word' which             # corresponds to vector of lookup table            for i in range(len(ss)):                ss[i][-1] = len(word_idx) - memory_size - i + len(ss)            # pad to memory_size            lm = max(0, memory_size - len(ss))            for _ in range(lm):                ss.append([0] * sentence_size)            lq = max(0, sentence_size - len(query))            q = [word_idx[w] for w in query] + [0] * lq            y = np.zeros(len(word_idx) + 1) # 0 is reserved for nil word            for a in answer:                y[word_idx[a]] = 1            S.append(ss)            Q.append(q)            A.append(y)        return np.array(S), np.array(Q), np.array(A)

处理完数据之后我们就得到了一组组训练数据，接下来就是构建模型和训练工作。先看一看模型构建部分的代码，这部分主要关注一下position_encoding和模型推理部分，其他部分由于文章长度限制不在贴上来，感兴趣的同学可以自行查看。下面看一下位置编码PE的实现：

    def position_encoding(sentence_size, embedding_size):        """        Position Encoding described in section 4.1 [1]        """        encoding = np.ones((embedding_size, sentence_size), dtype=np.float32)        ls = sentence_size+1        le = embedding_size+1        for i in range(1, le):            for j in range(1, ls):                #L矩阵每个元素的值。i，j分别减去行列值的一半。这里的实现方式和论文中好像不太一样，但最终都是编码位置信息。                encoding[i-1, j-1] = (i - (embedding_size+1)/2) * (j - (sentence_size+1)/2)        encoding = 1 + 4 * encoding / embedding_size / sentence_size        # Make position encoding of time words identity to avoid modifying them         encoding[:, -1] = 1.0        return np.transpose(encoding)

模型的推理部分代码如下所示：

    def _build_inputs(self):        self._stories = tf.placeholder(tf.int32, [None, self._memory_size, self._sentence_size], name='stories')        self._queries = tf.placeholder(tf.int32, [None, self._sentence_size], name='queries')        self._answers = tf.placeholder(tf.int32, [None, self._vocab_size], name='answers')        self._lr = tf.placeholder(tf.float32, [], name='learning_rate')    def _build_vars(self):        with tf.variable_scope(self._name):            nil_word_slot = tf.zeros([1, self._embedding_szie])            #concat 是为了添加一个全0的词向量            A = tf.concat(axis=0, values=[nil_word_slot, self._init([self._vocab_size-1, self._embedding_szie])])            C = tf.concat(axis=0, values=[nil_word_slot, self._init([self._vocab_size-1, self._embedding_szie])])            #这里使用adjacent是参数初始化方法。即B=A， Ak+1 = Ck, 所以不需要定义hop个A，直接使用相对应的C即可。            self.A_1 = tf.Variable(A, name='A')            self.C = []            for hopn in range(self._hops):                with tf.variable_scope('hop_{}'.format(hopn)):                    self.C.append(tf.Variable(C, name='C'))        self._nil_vars = set([self.A_1.name] + [x.name for x in self.C])    def _inference(self, stories, queries):        with tf.variable_scope(self._name):            #得到queries的向量表示，使用A_1来代替B            q_emb = tf.nn.embedding_lookup(self.A_1, self._queries)            #位置编码，将一句话中所有单词的embedding进行sum，然后得到一句话的表示            u_0 = tf.reduce_sum(q_emb*self._encoding, axis=1)            u = [u_0]            for hopn in range(self._hops):                if hopn == 0: #如果是第一层，要使用A-1进行embedding，否则使用C-i                    m_emb_A = tf.nn.embedding_lookup(self.A_1, self._stories)                    m_A = tf.reduce_sum(m_emb_A*self._encoding, axis=2)                else:                    with tf.variable_scope('hop_{}'.format(hopn - 1)):                        m_emb_A = tf.nn.embedding_lookup(self.C[hopn-1], stories)                        m_A = tf.reduce_sum(m_emb_A*self._encoding, axis=2)                #取出上一层的输出，作为u。并且为其扩展一个维度[batch_size, 1, embedding_size]                u_temp = tf.transpose(tf.expand_dims(u[-1], -1), [0, 2, 1])                #m_A的维度是[batch_size, memory_size, embedding_size]                dotted = tf.reduce_sum(m_A * u_temp, 2) #[batch_size, memory_size]                #queries对stories（记忆中每个句子）的权重，经过softmax，可以视为概率                probs = tf.nn.softmax(dotted)                #扩展一个维度并转置，[batch_size, 1, memory_size]                probs_temp = tf.transpose(tf.expand_dims(probs, -1), [0, 2, 1])                with tf.variable_scope('hop_{}'.format(hopn)):                    m_emb_C = tf.nn.embedding_lookup(self.C[hopn], stories)                    m_C = tf.reduce_sum(m_emb_C*self._encoding, axis=2)                #[batch_size, embedding_size, memory_size]                c_temp = tf.transpose(m_C, [0, 2, 1])                o_k = tf.reduce_sum(c_temp * probs_temp, axis=2) #[batch_size, embedding_size]                u_k = u[-1] + o_k                u.append(u_k)            with tf.variable_scope('hop_{}'.format(self._hops)):                return tf.matmul(u_k, tf.transpose(self.C[-1], [1, 0]))

接下来模型训练部分就比较简单了，我们可以看一下task1的训练结果如下图所示，可以看到对于task1这个简单任务来讲，本模型可以完全正确的回答相应问题。但是当遇到比较复杂的问题时，比如task2等其准确率还有待提高：

PTB 语言模型建模

这部分的代码可以参考https://github.com/carpedm20/MemN2N-tensorflow，这就是一个很传统的语言建模任务，其实就是给定一个词序列预测下一个词出现的概率，数据集使用的是PTB（训练集、验证集、测试集分别包含929K，73K，82K个单词，vocab包含10000个单词）。为了适应该任务，需要对模型做出下面几点的修改：

由于输入是单词级别，不再是QA任务中的句子，所以这里不需要进行句子向量的编码，直接将每个单词的词向量存入memory即可，也就是说现在mi就是每个单词的词向量，也就不需要位置编码这一步。
输出时给定此序列的下一个单词，也就是vocab中某个词的概率，这部分不需要修改
因为这里不存在Question，或者说每个训练数据的Question都是一样的，所以我们可以直接将Q向量设置为0.1的常量，不需要再进行embedding操作。
因为之前都是使用LSTM来进行语言建模，所以为了让模型更像RNN，我们采用第二种参数绑定方式，也就是让每层的A和C保持相同，使用H矩阵来对输入进行线性映射。
文中提出要对每个hop中一般的神经元进行ReLU非线性处理
采用更深的模型，hop=6或者7，而且memory size也变得更大，100。

解决完上面几个问题，我们就可以把Memory Network移植到语言建模的任务上面来了。由于这里数据比较简单，我们就不再对数据处理的代码进行介绍，直接看模型构建部分即可：

        #定义模型输入的placeholder，input对应Question，后面会初始化为0.1的常向量，time是时序信息，后面会按照其顺序进行初始化，注意其shape是batch_size*mem_size，因为它对应的memory中每句话的时序信息，target是下一个词，及我们要预测的结果，context是上下文信息，丫就是要保存到memory中的信息。        self.input = tf.placeholder(tf.float32, [None, self.edim], name="input")        self.time = tf.placeholder(tf.int32, [None, self.mem_size], name="time")        self.target = tf.placeholder(tf.float32, [self.batch_size, self.nwords], name="target")        self.context = tf.placeholder(tf.int32, [self.batch_size, self.mem_size], name="context")        #因为有多个hop，所以使用下面两个列表来保存每层的结果信息        self.hid = []        #对于第一层而言，输入就是input，所以直接将其添加到hid里面，方便后面循环中使用        self.hid.append(self.input)        self.share_list = []        self.share_list.append([])        self.lr = None        self.current_lr = config.init_lr        self.loss = None        self.step = None        self.optim = None        self.sess = sess        self.log_loss = []        self.log_perp = []    def build_memory(self):        self.global_step = tf.Variable(0, name="global_step")        #定义变量，A对应论文中的A，B对应论文中的C，C对应论文中的H矩阵，这里作者并未按照论文中变量的命名规则定义        self.A = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std))        self.B = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std))        self.C = tf.Variable(tf.random_normal([self.edim, self.edim], stddev=self.init_std))        # Temporal Encoding，时序编码矩阵，T_A对应T_A，T_B对应T_C        self.T_A = tf.Variable(tf.random_normal([self.mem_size, self.edim], stddev=self.init_std))        self.T_B = tf.Variable(tf.random_normal([self.mem_size, self.edim], stddev=self.init_std))        #下面两段是将context信息编码进入memory的过程，这里结合时序信息进行编码        # m_i = sum A_ij * x_ij + T_A_i        Ain_c = tf.nn.embedding_lookup(self.A, self.context)        Ain_t = tf.nn.embedding_lookup(self.T_A, self.time)        Ain = tf.add(Ain_c, Ain_t)        # c_i = sum B_ij * u + T_B_i        Bin_c = tf.nn.embedding_lookup(self.B, self.context)        Bin_t = tf.nn.embedding_lookup(self.T_B, self.time)        Bin = tf.add(Bin_c, Bin_t)        #对每一层，执行下面的操作        for h in xrange(self.nhop):            #取出hid中上一层的输出信息            self.hid3dim = tf.reshape(self.hid[-1], [-1, 1, self.edim])            #下面三行就是根据Question信息得到memory的相关性评分P，注意变量的纬度变化            Aout = tf.matmul(self.hid3dim, Ain, adjoint_b=True)            Aout2dim = tf.reshape(Aout, [-1, self.mem_size])            P = tf.nn.softmax(Aout2dim)            #根据P和输出记忆加权求和得到输出的信息            probs3dim = tf.reshape(P, [-1, 1, self.mem_size])            Bout = tf.matmul(probs3dim, Bin)            Bout2dim = tf.reshape(Bout, [-1, self.edim])            #根据论文，使用H矩阵对q进行线性映射，然后与o相加得到该层输出            Cout = tf.matmul(self.hid[-1], self.C)            Dout = tf.add(Cout, Bout2dim)            #将各层的中间变量保存到列表当中，方便查看每层的功能            self.share_list[0].append(Cout)            #如果需要对某些元素执行ReLU函数，根据相应设置进行操作            if self.lindim == self.edim:                self.hid.append(Dout)            elif self.lindim == 0:                self.hid.append(tf.nn.relu(Dout))            else:                F = tf.slice(Dout, [0, 0], [self.batch_size, self.lindim])                G = tf.slice(Dout, [0, self.lindim], [self.batch_size, self.edim-self.lindim])                K = tf.nn.relu(G)                self.hid.append(tf.concat(axis=1, values=[F, K]))    def build_model(self):        self.build_memory()        #输出层，使用hid最后一个变量得到输出的答案        self.W = tf.Variable(tf.random_normal([self.edim, self.nwords], stddev=self.init_std))        z = tf.matmul(self.hid[-1], self.W)        #交叉熵损失函数        self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=z, labels=self.target)        self.lr = tf.Variable(self.current_lr)        self.opt = tf.train.GradientDescentOptimizer(self.lr)        #梯度截断，如果梯度大于设定值，则进行截断        params = [self.A, self.B, self.C, self.T_A, self.T_B, self.W]        grads_and_vars = self.opt.compute_gradients(self.loss,params)        clipped_grads_and_vars = [(tf.clip_by_norm(gv[0], self.max_grad_norm), gv[1]) \                                   for gv in grads_and_vars]        inc = self.global_step.assign_add(1)        with tf.control_dependencies([inc]):            self.optim = self.opt.apply_gradients(clipped_grads_and_vars)        tf.global_variables_initializer().run()        self.saver = tf.train.Saver()

最终的训练结果如下所示，由于训练比较耗时，所以只截了前30个epoch的训练效果，可以看到验证集的perplexity已经降到了130多，效果还算可以。

总结

通过上面两个实验，想必对Memory Networks的了解更加深了一步，而且如何将其用到不同NLP任务当中也有了一定了解。相比上篇论文中介绍的Memory Networks，本片提出的End-To-End Memory Networks减少了监督程度，从训练数据中不再需要答案跟某一句话相关这一重要信息可以看出来，模型可以自己学习到与问题最相关的输入在哪里，这里我们可以结合论文中的一个图片进行理解，从下图可以知道模型的每个hop层都会学习到与问题相关的输入，对于简单问题，三层都可以学到最相关的那个句子，而对于比较复杂的问题（问题可能会与多个句子相关），每个hop都会学习到相应的输入，而且还呈现一种推理的关系。说明这种外部Memory效果是很好的。

阅读全文

0 0