命名实体识别

来源：互联网发布：绝爱后宫我知帝王心txt 编辑：程序博客网时间：2024/05/02 04:19

诸如中文分词、词性标注、命名实体等问题均属于序列标签标注问题。经典的模型有HMM,MEMM,CRF模型，这些都是比较传统的方法，三种模型各有优劣，HMM模型假设观测独立，不依赖观测之间的序列特征，MEMM虽然加入了观测序列之间的跳转特征，但由于采用了局部归一化引入了标记偏置的问题，最后CRF采用全局归一化从而弥补了HMM和MEMM的缺点，但是计算量却比较大。
随着深度学习的兴起，将DNN模型应用到标签标注问题上，取得了不俗的结果。比较各模型的结果，一般来说, DNN之前，CRF的结果最好，应用中也最为广泛， DNN这把神器出来后，state-of-the-art的结果均已被DNN的各种模型占领。DNN重在特征的学习和表示，通过DNN学习特征，取代传统CRF中的特征工程，集合DNN和CRF各自的优点，是这一系列方法的主要思路。
下面介绍几种常见的DNN+CRF模型在命名实体识别问题上的具体应用。

1 文献调查

文献年份方法conll2003 ,English, F1值 [1]Neural Architectures for Named Entity Recognition 2016.4 BiLSTM+CRF+character 90.94 [2]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF 2016.5 BiLSTM+CNN+CRF+character 91.21 [3]LSTM-Based NeuroCRFs for Named Entity Recognition 2016.9 LSTM/BiLSTM+CRF 89.30/89.23 [4]Attending to Characters in Neural Sequence Labeling Models 2016.11 LSTM/BiLSTM+CRF 84.09 [5]Fast and Accurate Entity Recognition with Iterated Dilated Convolutions 2017.7 IDCNN+CRF 90.54

±0.18 [6]Named Entity Recognition with Gated Convolutional Neural Networks 2017 GCNN+CRF 91.24

以上模型结果存在差异一方面是因为模型结构和输入数据不同，另一方面是模型参数的设置不同，比如embedding的维度，lstm的隐藏状态ht 的维度等。以上数据仅作为参考，不作为绝对标准。

2 命名实体识别模型

以上各模型可以概括为如下的结构，不同之处主要在于使用的是CNN还是RNN（LSTM,Bi-LSTM）,以及输入数据(word-embeddding, character-representation)。

2.1 模型基本结构

给定观测序列

X = (x 1, x 2, \dots, x n),

假设矩阵

P∈n×k 是有DNN输出的得分矩阵(matrix of scores output).其中

k是需要标注的标签个数;

Pi,j表示序列中的第

i 个word对应的第

j 标签的分数.假设上述序列预测标签为

y = (y 1, y 2, \dots, y n)

,定义该序列的得分为

s (X, y) = \sum i = 0 n A y i, y i + 1 + \sum i = 1 n P i, y i,

其中

A为转移矩阵，

Ai,j 为从标签

i 跳转到标签

j 的概率值.

y0 和

yn 表示添加到输入序列上的

start和

end对应的标签。因此

X∈R(k+2)×(k+2).
通过softmax函数对所有可能的标签序列的得分进行变换，得到各个序列对应的概率值：

p (y | X) = e s ( X , y ) \sum y ~ \in Y X e s ( X , y ~ )

概率

maxy,Xp(y|X) 对应的标签序列为最终的标签预测结果。在训练的时候则采用最大化对数概率：

log (p (y | X)) = s (X, y) - log ⎛ ⎝ \sum y ~ \in Y X e s (X, y ~) ⎞ ⎠

其中

YX 为序列

X 对应的所有可能的标签序列。最后，

y * = arg max y ~ \in Y X s (X, y ~)

2.2 几种具体的网络

2.2.1 RNN 网络（Bi-LSTM）

文献[1]的输入数据为word-embedding dt以及 character-embedding ht。对于wt(对应的字符序列为(c1,c2,⋯,cR),采用Bi-LSTM生成character-embedding,即 $h i \to = L S T M (c t, h i - 1 - \to -), h i \leftarrow = L S T M (c i, h i + 1 \leftarrow - -), h = [h R - \to, h 1 \leftarrow -]$
将生成的character-embedding 级联到 word-embedding, 作为每个word的featrure,送入到 Bi-LSTM, 最后将Bi-LSTM的输出送入到CRF层。其中character-embedding 采用forward-LSTM和 backword-LSTM的输出的最终状态(final state)拼接而成，即
$x t = c o n c a t {d t, h}$
究其原因，LSTM在生成wt representations时，更多的是包含距离 wt 最近的信息(they have a representation biased towards their most recent inputs.)。因此，在用Bi-LSTM提取character-level的representation时，forward LSTM的最终状态能够更好的对wt 的后缀进行表示，backward LSTM的最终状态能够更好的对 wt 的前缀进行表示。
文献[3]的输入数据为word-embedding以及pos-embedding(5-dim), capitalization-embedding(5-dim)(也就是大小写标识)。
文献[4]的输入数据为word-embedding 以及character-embedding,但是word-embedding和character-embedding组合的方式不用于文献[1].首先对h 做了一层非线性变换，即
$h^= tanh (W t h),$ 原因是通过非线性变换（相当于增加了一层隐藏层）可以提取更高层的特征(An additional hidden layer allows the model to detect higher-level feature combinations, while constraining it to be small forces it to focus on more generalisable patterns).(不是很理解)其次，将word-embedding dt 和 character-embedding h^进行加权组合 $z = σ (W (3) z tanh (W (1) z d t + W (2) z h^)), x ~ = z \cdot d + (1 - z) \cdot h^$
这种方式的有点是网络可以动态决定word-embedding 和character-embedding 分别取多少。for example,words with regular suffixes can share some character-level features, whereas irregular words can store exceptions into word embeddings.(有规则后缀的单词可以共享一些字符级特性，而不规则的单词可以将异常存储为单词嵌入。–有道翻译 :) ) .
为了让d和 h^ 进行对齐，作者又设计了一种目标函数：
$E~=E+∑t=1Tgt(1−cos(h^),dt)gt={01ifwt==OOVotherwise$
最后，贴上一段文献中比较有启发意义的表述：

While the character component learns general regularities that are shared between all the words, individual word embeddings provide a way for the model to store word-specific information and any exceptions. Therefore, while we want the character-based model to shift towards predicting high-quality word embeddings, it is not desireable to optimise the word embeddings towards the character-level representations. This can be achieved by making sure that the optimisation is performed only in one direction。
简言之，期望 character-embedding 能够预测出高质量的word-embedding，但反之不亦然。
2.2.2 CNN 网络
考虑RNN不能并行计算，并且虽然RNN能够解决长时以来的问题，但是当序列较长时，序列尾部对序列头部的依赖依然会损失很多信息。于是有学者考虑用CNN进行建模。
文献[5]采用CNN对序列建模，并且采用的是迭代扩张卷积Iterated Dilated Convoltuons(IDCNNs).输入数据有word-embdding、character-embdding,shape-embdding, 通过concat 串起来。具体来说：1，采用滑动窗口，在窗口内做卷积，然而卷积层后面并没有接池化层，而是采用扩张卷积的思路，继续卷积，进行了L次迭代扩张卷积，将这一过程称为一个Block。由于采用了扩张的方式，没增加一层网络，就会使得新增层的神经元的”感受区域”以2的指数倍增加(rl−1，窗口宽度为w=2r+1)；2，采用Multi-Scale Context Aggregation 机制，将上一个Block的输出作为下一个Block的输入。这样，由于上一个Block已经涵盖了真个输入的所有信息，具备了全局信息,那么这种方式实际上为当前计算的word添加非局部信息。

By feeding the outputs of each dilated convolution as the input to the next, increasingly non-local information is incorporated into each pixel’s representation<<

3，改进的损失函数。对于DNN+CRF的结构来说，损失函数一般是将网络的输出送入到CRF层计算整个序列的标签的最大概率。由于文章提出的模型具有多个Block，每个Block后都可以直接接入CRF层并计算序列的标签概率。这样，假如使用了

Lb个Block,每个Block的输出都送入到CRF层计算概率，将这

Lb个标签作为整体进行优化。这种除了提高精度外，还能降低梯度弥散问题。
Instead of explicit reasoning over output labels during inference, we train the network such that each block is predictive of output labels. Subsequent blocks learn to correct dependency violations of their predecessors, refining the final sequence prediction.
采用扩张卷积而不是池化有两个原因：一，凡是池化就有信息损失，不管是有用的还是无用的；二，扩张卷积可以以网络层数的指数级增加每个节点的覆盖范围（也就是感受野）；
文章分别对sentence-level和document-level分别做了实验，待补充
文章采用dropout with expectationlinear regularization。具体还没有看，听着还是很fancy的.
The version of dropout typically used in practice has the undesirable property that the randomized predictor used at train time differs from the fixed one used at test time.Ma et al. (2017) present dropout with expectation linear regularization, which explicitly regularizes these two predictors to behave similarly.

2.2.3 RNN(Bi-LSTM)+CNN网络

文献[2]的输入数据也是word-embedding 和 character-embedding, 与上述文献不同的地方时，本文采用CNN提取 character-embedding, 其网络结构如下图
图1 RNN+CNN+CRF网络结构图
作者认为，CNN是一种提取形态信息的有效方法。比如一个词的前缀或者后缀。

CNN is an effective approach to extract morphological information (like the prefix or suffix of a word) from characters of words and encode it into neural representations。

2.3 一些trick

2.3.1 Embedding

对于输入数据的embedding,要么随机初始化，要么采用预训练好的embedding,比如w2v,glove. 由于embedding可以在更大的数据集上训练（无监督的），因此采用pre-train的embedding进行初始化，使得网络能够收敛到更好的结果。试验结果也表明pre-train的embedding相比随机初始化的embedding 有较大的提升[1]。
character-embedding 通常是随机初始化，在训练中进行更新优化；当然也可以用w2v or glove 在大语料上预训练，然后在训练中进行微调；

2.3.2 Initialization

随机初始化embedding, 从均分分布[−3dim−−−√,3dim−−−√] 中采样,其中dim 为embedding的维度。实际上就是He-initializer. 这一步的初始化，如果采用CNN提取character-embedding时初始化建议采用He-initializer(激活函数为ReLU),如果才能用LSTM提取embedding是初始化建议采用Xavier-initialize(激活函数为σ,tanh).
Weight Matrix and Bias Vectors. Weight Matrix 从均匀分布[−6r+c−−−√,6r+c−−−√] 中采样,其中 r和c分布是Weight Matrix的行数和列数。Bias Vectors初始化为零，但LSTM中的遗忘门对应的偏置bf 初始化为1.0。
Most applications of LSTMs simply initialize the LSTMs with small random weights which works well on many problems. But this initialization effectively sets the forget gate to 0.5. This introduces a vanishing gradient with a factor of 0.5 per timestep, which can cause problems whenever the long term dependencies are particularly severe.This problem is addressed by simply initializing the forget gates bfto a large value such as 1 or 2. By doing so, the forget gate will be initialized to a value that is close to 1, enabling gradient flow.

2.3.3 OOV

[1]:words that do not have an embedding in the lookup table are mapped to a UNK embedding. To train the UNK embedding, we replace singletons with the UNK embedding with a probability 0.5.

2.3.4 dropout

通常在hidden layers 将dropout 设为0.5, 在 input layers 将 dropout 设为0.8。一般来讲，对于隐藏层的神经元，其 dropout率等于 0.5时效果最好，因为此时通过 dropout 方法，随机生成的网络结构最具多样性。

2.4 总结

加入CRF层是为了利用标签之间的局部依赖关系;
加入 character-level 是为了缓解 OOV问题; unseen words and rare words 的 embedding由于缺少数据，通常不能够有效的表示该word,引入character-level embedding可以有效缓解这个问题，比如对于“cabinets”,虽然没有数据集中没有”cabinets”,但是有”cabinet”以 “-s”,那么就可以在character-level 推导出 “cabinets”.同时频繁词项又可以获得更高质量的word-embedding(The main benefits of character-level modeling are expected to come from improved handling of rare and unseen words, whereas frequent words are likely able to learn high-quality word-level embeddings directly. We would like to take advantage of this, and train the character component to predict these word embeddings.[4])
在文本分类中有学者采用多通道输入，一个静态的word-embedding,一个动态的embedding,静态的embedding在学习中保持不变（维护了全局特征），动态的embedding在学习中进行微调，也就是task-specific。应用到NER中不知效果如何。

3 中文命名实体识别

在模型结构上中文命名实体识别与英文的命名实体识别并没有的区别，主要的一点是，character-embedding的使用，在中文语料下，单个汉字并没有形态上的区别。待我整理完数据后做个试验。

Ref

An empirical exploration of recurrent network architectures

阅读全文

0 0