attention model

来源：互联网发布：大数据工程师工作强度编辑：程序博客网时间：2024/06/06 02:23

How did I select papers?

First, I tried to search for “attention” in CVPR2014-2016, ICCV2009-2015 and ACMMM2012-2015. However, there are only a few papers containing this key words.
Then, I searched for “attention model” in google, and found blogs where talk about it and list some papers.

Attention

[1] talks about “Attention” and why we need it: When people see a picture, usually they would move their eyes around over time and gather information about the scene. They don’t see every pixel of the image at once. They attend to certain aspects of the picture one time-step at a time and aggregate the information. That is exactly the kind of power we want to give to our neural network models. The usual convolutional network model does have the ability to be able to recognize cluttered images but how do we find the exact set of weights which are “good”? That is a difficult task. By providing the network with a new architecture-level feature which allows it to attend to different parts of image sequentially and aggregate information over time, we make that job easier, because now the network can simply learn to ignore the clutter (or so is the hope).
In natural language processing, there is a typical task: natural language generation, which is, given context, generate target(relevant sentence). For instance, machine translation. When using deep learning to solve this task, one common method is encoder-decoder framework.
Given one sequence of words, [3] uses one RNN encoder to generate a context vector(the last hidden state of RNN), then, one RNN decoder uses this hidden state as initial state to generate words one by one.
这里写图片描述
Fig 1
However, no matter how long the input sequence is, the output of the encoder is a one vector whose dimension is just several hundred, which means that the long the input sequence is, the more information of the state vector will lose.
In fact, decoder can use all information of the input sequence instead of just the last state.
这里写图片描述
Fig 2
In paper [2], when generate the Hypothesis states(h7,h8,h9), instead of the last state(h6), all the input vectors(h1,…,h5) will be inputed. Besides, not all the inputs vectors will influence the generation of the next state. For example, to translate “私は猫が好きです。” to “I like cats”. To generate the word ”like”, we should focus on the input word ”好き” instead of other words. “Attention” means to select proper input vectors and use them to generate next target state.

Soft-attention and Hard-attention

Papers

Translation

Effective approaches to attention-based neural machine translation[4]

The attention-based models of [4] are classified into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions.
Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches first take as input the hidden state ht at the top layer of a stacking LSTM. The goal is then to derive a context vector ct that captures relevant source-side information to help predict the current target word yt. While these models differ in how the context vector c t is derived, they share the same subsequent steps.
这里写图片描述
Fig 3. Neural machine translation – a stacking recurrent architecture for translating a source sequence A B C D into a target sequence X Y Z. Here, marks the end of a sentence

Fig 4. Global attentional model – at each time step t, the model infers a variable-length alignment weight vector at based on the current target ht and all source states hs. A global context state ct is then computed as the weighted average, according to at, over all the source states.
这里写图片描述
Fig 5. Local attention model – the model first predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct , a weighted average of the source hidden states in the window. The weights at are inferred from the current target state ht and those source states hs in the window.

Neural machine translation by jointly learning to align and translate[5]

这里写图片描述

Reference

[1] http://stackoverflow.com/questions/35549588/soft-attention-vs-hard-attention
[2] Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kočiský, T., & Blunsom, P. (2015). Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
[3] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
[4] Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
[5] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[*] https://www.zhihu.com/question/36591394

Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv preprint arXiv:1502.08029.

0 1