文字识别文献阅读

来源:互联网 发布:c 动漫网站源码 编辑:程序博客网 时间:2024/06/04 08:53

一.文献阅读

自然场景文字检测

[2017-AAAI]TextBoxes: A Fast Text Detector with a Single Deep Neural Network

  1. 主要内容:
    This paper presents an end-to-end trainable fast scene text detector, named TextBoxes, which detects scene text with both high accuracy and efficiency in a single network forward pass, involving no post-process except for a standard nonmaximum suppression. TextBoxes outperforms competing methods in terms of text localization accuracy and is much faster, taking only 0.09s per image in a fast implementation. Furthermore, combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks.

  2. 主要贡献:
    To summarize, the contributions of this paper are threefold: First, we design an end-to-end trainable neural network model for scene text detection. Second, we propose a word spotting/end-to-end recognition framework that effectively combines detection and recognition. Third, our model achieves highly competitive results while keeping its computational efficiency.

  3. 几种文本识别的方式:
    A. Character-based
    B. Word-based
    C. Text-line-based

  4. 架构:
    这里写图片描述

TextBoxes Architecture. TextBoxes is a 28-layer fully convolutional network. Among them, 13 are inherited from VGG-16. 9 extra convolutional layers are appended after the VGG-16 layers. Text-box layers are connected to 6 of the convolutional layers. On every map location, a text-box layer predicts a 72-d vector, which are the text presence scores (2-d) and offsets (4-d) for 12 default boxes. A non-maximum suppression is applied to the aggregated outputs of all text-box layers.

  1. 实验数据对比:

这里写图片描述

Text localization on ICDAR 2011 and ICDAR 2013. P, R and F refer to precision, recall and F-measure respectively. FCRNall+filts reported a time consumption of 1.27 seconds excluding its regression step so we assume it takes more than 1.27 seconds.

这里写图片描述

Examples of text localization results. The green bounding boxes are correct detections; Red boxes are false positives; Red dashed boxes are false negatives.

这里写图片描述

Word spotting and end-to-end results. The values in the table are F-measure. For ICDAR 2013, strong, weak and generic mean a small lexicon containing 100 words for each image, a lexicon containing all words in the whole test set and a large lexicon respectively. We use a lexicon containing 90k words as our generic lexicon. The methods marked by “*” are published on the ICDAR 2015 Robust Reading Competition website http://rrc.cvc.uab.es.

  1. 存在的缺陷

TextBoxes performs well in most situations. However, it still fails to handle some difficult cases, such as overexposure and large character spacing.

  1. 结论

We have presented TextBoxes, an end-to-end fully convolutional network for text detection, which is highly stable and efficient to generate word proposals against cluttered backgrounds. Comprehensive evaluations and comparisons on benchmark datasets clearly validate the advantages of Textboxes in three related tasks including text detection, word spoting and end-to-end recognition. In the future, we are interested to extend TextBoxes for multi-oriented texts, and combine the networks of detection and recognition into one unified framework.

自然场景文字识别

[2015-CoRR] AnEnd-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

1.主要内容

(1)It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much
smaller model, which is more practical for real-world application scenarios.

2.主要贡献

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a
combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less
storage space.

3.架构
这里写图片描述

The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

4.LSTM
这里写图片描述
(a)The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

5.参数配置

这里写图片描述

Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively

6.数据对比

这里写图片描述

Recognition accuracies (%) on four datasets. In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None”denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.

7.结论

The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highlycompetitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN. Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more
practical in real-world applications is another direction that is worthy of exploration in the future.

二.RNN & LSTM
循环神经网络(RNN, Recurrent Neural Networks)
循环神经网络(Recurrent Neural Networks,RNNs)已经在众多自然语言处理(Natural Language Processing, NLP)中取得了巨大成功以及广泛应用。不同于传统的FNNs(Feed-forward Neural Networks,前向反馈神经网络),RNNs引入了定向循环,能够处理那些输入之间前后关联的问题。

什么是RNN
RNN的目的使用来处理序列数据。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNNs之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再无连接而是有连接的,并且隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。理论上,RNNs能够对任何长度的序列数据进行处理。但是在实践中,为了降低复杂性往往假设当前的状态只与前面的几个状态相关,下图便是一个典型的RNNs:
这里写图片描述
RNNs能干什么?
RNNs已经被在实践中证明对NLP是非常成功的。如词向量表达、语句合法性检查、词性标注等。在RNNs中,目前使用最广泛最成功的模型便是LSTMs(Long Short-Term Memory,长短时记忆模型)模型,该模型通常比vanilla RNNs能够更好地对长短时依赖进行表达,该模型相对于一般的RNNs,只是在隐藏层做了手脚。
如何训练RNNs
对于RNN是的训练和对传统的ANN训练一样。同样使用BP误差反向传播算法,不过有一点区别。如果将RNNs进行网络展开,那么参数W,U,V是共享的,而传统神经网络却不是的。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,并且还以来前面若干步网络的状态。比如,在t=4时,我们还需要向后传递三步,已经后面的三步都需要加上各种的梯度。该学习算法称为Backpropagation Through Time (BPTT)。
LSTM
Long Short Term 网络—— 一般就叫做 LSTM ——是一种 RNN 特殊的类型,可以学习长期依赖信息。LSTM 由 Hochreiter & Schmidhuber (1997) 提出,并在近期被 Alex Graves 进行了改良和推广。在很多问题,LSTM 都取得相当巨大的成功,并得到了广泛的使用。LSTM 通过刻意的设计来避免长期依赖问题。记住长期的信息在实践中是 LSTM 的默认行为,而非需要付出很大代价才能获得的能力!(待补充)
三.GoogLeNet
背景
神经网络跟深度学习技术的快速发展,人们容易通过高性能的硬件、庞大的标签训练数据和更深更宽的网络模型等手段,获得更好的预测效果,但同样带来了很大缺陷。
一个缺陷是更深更宽的网络模型回产生巨大的参数,从而容易出现过拟合的现象。另一个缺陷是网络规模加大会极大的增加计算量,消耗更多的计算资源。解决这两个缺陷的根本方法是将全连接甚至一般的卷积转化为稀疏连接。Google团队提出了Inception结构来实现这一目标。
Inception结构
GoogLenet提出了Inception module的概念,旨在强化基本特征提取模块的功能,一般的卷积层只是一味增加卷积层的深度,但是在单层上卷积核却只有一种,比如对于VGG,单层卷积核只有3x3大小的,这样特征提取的功能可能就比较弱。GoogLenet想的就是能不能增加单层卷积层的宽度,即在单层卷积层上使用不同尺度的卷积核,GoogLenet构建了Inception module这个基本单元,基本的Inception module中有1x1卷积核,3x3卷积核,5x5卷积核还有一个3x3下采样,如下图所示,

GoogLeNet模型

这里写图片描述

GoogLenet的训练过程也是很有特点的,我们可以看出GoogLenet的 网络其实也是非常深的,如果梯度从最后一层传递到第一层,梯度基本已经没有了,所以GoogLenet在网络的中间加了softmax层,通过这些层获取额外的训练loss,然后根据这个loss计算对应的梯度,最后把梯度加到整个网络的梯度中,进行梯度传播,这样可以有效缓解梯度消失的问题。
这里写图片描述

原创粉丝点击