【论文】New baseline correction algorithm for text-line recognition with bidirectional recurrent neural

来源：互联网发布：红蜘蛛端口号怎么修改编辑：程序博客网时间：2024/05/29 05:07

这篇论文主要提出了一个手写字符识别baseline预处理的方法，用来解决baseline倾斜波动的问题；这个方法是在baseline关键点粗略估计的基础上，使用滑动窗口检测来完成的。最后使用BLSTM完成了实验检验。

这篇文章提出的预处理方法可以用在free-style手写文字上，主要是使用基于滑动窗口的方法将baseline标准化到水平行。本方法也区别于机器学习方法。我们将我们的预处理方法和RNN识别相连接，使用了两个RNN网络：一个从左向右扫描，另外一个从右向左扫描，这样，past和future的信息都能被考虑，上下文相关。

预处理的3个步骤：background cleaning， baseline correction， slant correction

background cleaning：
1. 通过Otsu thresholding方法二值化原始图像，通过Canny edge extractor方法提取边缘，通过Hough transform提取直线。这条线相当于Hough(ρ, θ)的最高点。
2. h-mean and h-max, the average and maximum component heights are extracted on connected components which do not reach upper or lower limits of the image.
3. An initial set S-c of candidate noisy components is composed of connected components which are distant from the main text line. Their distance is greater than h-mean∕2 computed in Step 2.
4. From previous set S-c, we build set S-n ⊂ S-c of noisy components to remove. These are S-c components in contact with the upper and lower edges of the image as well as S-c components whose gravity center is peripheral: the distance of their gravity center to the main text line is greater than h-max computed in Step 2.
这里写图片描述

baseline correction：
1. the vertical projection profile (PP) is created by counting at each y position the number of foreground pixels along the horizontal direction.
2. the distribution (histogram) of these projection values is computed. This histogram is expected to have a principal mode corresponding to the text-line core zone since the core zone is assumed to have a higher pixel density than the zones corresponding to ascenders and descenders.
3. a thresholding algorithm (Otsu’s method) is applied to the above distribution. The text core zone corresponds to the largest continuous core zone above threshold on the PP profile.
这里写图片描述

slant correction：
We use the Vinciarelli’s approach which relies on histograms of projection values along several angle directions.
这里写图片描述

提取了28维的特征：
F1: foreground pixel density
F2: number of foreground/background transitions between adjacent cells
F3: gravity center position difference with following window
F4: relative position of gravity center
F5: pixel density above upper baseline (ascenders zone)
F6: pixel density under lower baseline (descenders zone)
F7: number of foreground/background transitions between cells above lower baseline (ascendersþ corezone)
F8: relative position of gravity center wrt baselines
F9 to F20: local convexity features
F21 to F28: pixel density for each frame column (considering a window size w ¼ 8)

BLSTM识别：
A recurrent network consists of an input layer, an output layer and an hidden layer. At each time step t, the input layer consists of the frame extracted at t. The output layer consists of 91 units, each unit being associated with a given character (A to Z, a to z, numerals, symbols). The two networks, forward and backward, share the same input and output layers. They differ in their hidden layers.
However in practice there is an important decay in the error signals during gradient-based learning.
These blocks, called LSTM contain a memory cell which either keeps information through long time intervals (more than 1000 time samples) or can be reset in an instant.
Input gate is placed before the cell unit (at the center of the block in Fig. 11).15 Its main design is to protect the cell’s memory from nonrelevant inputs. Thus, cell can keep relevant information through time via its recurrent loop.
Output gate is placed after the cell unit. Similarly, output gate protects output layer from current information contained in the cell if it is not yet relevant for outputting.
Forget gate is a more recent improvement designed to reset the cell’s value and to avoid its endless growth. Moreover, forget gate can erase the cell’s memory once its content is irrelevant to present inputs.
Peepholes are connexions between the cell and the different gates which allow those gates to spy on current cell state.
这里写图片描述

We use a BLSTM with one hidden layer containing 100 blocks. The BLSTM recognizer is trained with a gradient-based method, “Back-Propagation Through Time.“ After each training epoch, the recognition error rate is evaluated on a validation set. If error rate does not decrease for twenty epochs, network training is stopped. This strategy avoids data over-fitting.

0 0