Whole is Greater than Sum of Parts

来源：互联网发布：windows 10官方下载编辑：程序博客网时间：2024/06/05 15:57

Whole is Greater than Sum of Parts:
Recognizing Scene Text Words

II. WORD REPRESENTATION AND MATCHING
We propose a novel method to recognize the word contained
in an image as a whole. We extract features from the
image, and match them with those computed for each word in
the lexicon. To this end, we present a gradient based feature
set, and then a weighted Dynamic Time Warping scheme in
the remainder of this section.
Gradient based features. Some of the previous approaches
binarize a word image into character vs non-character regions
before computing features [9]. While such pre-processing
steps can be effective to reduce the dimensionality of the

feature space, it comes with its disadvantages. The results
of binarization are seldom perfect, contain noise, and this
continues to be an unsolved problem in the context of scene
text images. Thus, we look for other effective features, which
do not rely on binarized images. Inspired by the success of
Histogram of Oriented Gradient (HOG) features [7] in many
vision tasks, we adapted them to the word recognition problem.
To compute the adapted HOG features, we begin by
applying the Canny edge operator on the image. Note that
we do not expect a clean edge map from this result. We
then compute the orientation of gradient at each edge pixel.
The gradient orientations are accumulated into histograms over
vertical (overlapping) strips extracted from the image. The
histograms are weighted by the magnitude of the gradient.
An illustration of the feature computation process in shown
in Fig. 3. At the end of this step, we have a representation of
the image in terms of a set of histograms. In the experimental
section we will show that these easy to compute features are
robust for the word matching problem.
Matching words. Once words are represented using a set of
features, we need a mechanism to match them. The problem

is how to match the scene text and synthetic lexicon based
images1. We formulate the problem of matching scene text
and synthetic words in a maximum likelihood framework.
Let X = {x1, x2, . . . , xm} and Y = {y1, y2, . . . , ym} be
the feature sequences from a given word and its candidate
match respectively. Each vector xi and yi is a histogram of
gradient features extracted from a vertical strip. Let ω =
{ω1, ω2, . . . , ωK} represent a set of word images where K is
the total number of lexicon words. Since we assume features at
each vertical strips are independent, the joint probability that
the feature sequences X and Y originate from the same word
ωk, i.e. P(X, Y |ωk) can be written as the multiplication of
joint probabilities of features originating from the same strip,
i.e.,
P(X, Y |ωk) =

i
P(xi, yi|ωk). (1)
In a maximum likelihood framework, the problem of finding
an optimal feature sequence Y for a given feature sequence
X is equivalent to maximize

i P(xi, yi|ωk) over
all possible Y s. This can be written as minimization of an
objective function f, i.e., minY

i f(xi, yi|ωk). Where f is
the weighted squared l2-distance between feature sequences
X and Y i.e.,f(xi, yi) = (xi − yj)wi(xi − yj). Here wi is
the weight to feature xi. These weights are learned from the
synthetic images, and are proportional to the discriminitiveness
of features. In other words, given a feature sequence X and
a set of candidate sequences Y s, the problem of finding
the optimal matching sequence becomes as minimizing f
over all candidate sequences Y . This leads to the problem
of alignment of sequences. We propose a weighted dynamic
programming based solution to solve this problem. Dynamic
Time Warping [17] is used to compute a distance between
two time series. The weighted DTW distance DTW(m, n)
between the sequences X and Y can be recursively computed
using dynamic programming as:
DTW(i, j) = min

DTW(i − 1, j) + D(i, j)
DTW(i, j − 1) + D(i, j)
DTW(i − 1, j − 1) + D(i, j),
(2)
where D(i, j) is the distance between features xi and yj ,
and the local distance matrix D is written as: D = (X −
Y )TW(X − Y ). The diagonal matrix W is learnt from
synthetic images. For this we cluster all the feature vectors
computed over vertical strips of

H(clusterp) = −
K
k=1 Pr(yj ∈ ωk, yj ∈ clusterp)
×logK(Pr(yj ∈ ωk, yj ∈ clusterp)), (3)
where Pr is the joint probability of feature yj originating
from class ωk and falling in clusterp. High entropy of a
cluster indicates that the features corresponding to that cluster
are almost equally distributed in all the word classes. In
other words, such features are less informative, and thus
are assigned a low weight during matching. The weight wj
associated with a feature vector yj is computed as: wj =
1 − H(clusterp), if yj ∈ clusterp.
Warping path deviation based penalty. To give high penalty
to those warping paths which deviate from the near diagonal
paths we multiply them with a penalty function log10(wp −
wpo), where wp and wpo are warping path of DTW matching
and diagonal warping path respectively. This penalizes warping
paths where a small portion in one word is matched with a
large portion in another word.
Dynamic k-NN. Given a scene text and a ranked list of
matched synthetic words (each corresponding to one of the
lexicon words), our goal is to find the text label. To do so,
we apply k-nearest neighbor. One of the issues with a nearest
neighbor approach is finding a good k. This parameter is often
set manually. To avoid this, we use dynamic k-NN. We start
with an initial value of k and measure the randomness of the
top k retrievals. Randomness is maximum when all the top k
retrievals are different words, and is minimum (i.e. zero) when
all the top k retrieval are same. We increment k by 1 until this
randomness decreases. At this point we assign the label of the
most frequently occurring synthetic word to a given scene text.
In summary, given a scene text word and a set of lexicon
words, we transform each lexicon into a collection of synthetic
images, and then represent each image as a sequence of features.
We then pose the problem of finding candidate optimal
matches for a scene text image in a maximum likelihood
framework and solve it using weighted DTW. The weighted
DTW scheme provides a set of candidate optimal matches. We
then use dynamic k-NN to find the optimal word in a given
scene text image.
III. EXPERIMENTS AND RESULTS
In this section we present implementation details of our
approach, and its detailed evaluation, and compare it with the
best performing methods for this task namely [12], [14], [18],
[19].
A. Datasets
For the experimental analysis we used two datasets, namely
Street View Text (SVT) [1] and ICDAR 2003 robust word
recognition [2]. The SVT dataset contains images taken from
Google Street View. We used the SVT-word dataset, which
contains 647 images, relevant for the recognition task. A lexicon
of 50 words is also provided with each image. The lexicon
for the ICDAR dataset was obtained from [18]. Following the
protocol of [18], we ignore words with less than two characters
or with non-alphanumeric characters, which results in 863
words overall. Note that we could not use the ICDAR

Synthetic Word Generation. For every lexicon word we
generated synthetic words with 20 different styles and fonts
using ImageMagic.2 We chose some of the most commonly
occurring fonts, such as Arial, Times, Georgia. Our observations
suggest that font selection is not a very crucial step
for overall performance of our method. A five pixel-width
padding was done for all the images. We noted that all the
lexicon words were in uppercase, and that the scene text may
contain lowercase letters. To account for these variations, we
also generated word images where, (i) only the first character
is in upper case; and (ii) all characters are in lower case.
This results in 3 × lexicon size × 20 images in the synthetic
database. For the SVT dataset, the synthetic dataset contains
around 3000 images.
Preprocessing. Prior to feature computation, we resized all
the word images to a width of 300 pixels, with the respective
aspect ratio. We then applied the popular non-local means filter

smoothing on scene text images. We also remove the stray
edges pixels less than 20 in number. Empirically, we did not
find this filtering step to be very critical in our approach.
Features. We used vertical strips of width 4 pixels and a
2-pixel horizontal shift to extract the histogram of gradient
orientation features. We computed signed gradient orientation
in this step. Each vertical strip was represented with a
histogram of 9 bins. We evaluated the performance of these
features in Table I, in comparison with that of profile features
used in [16]. Profile features consist of: (1) projection profile,
which counts the number of black pixels in each column.
(2) upper and lower profile, which measures the number of
background pixels between the word and the word-boundary
(3) transition profile, is calculated as the number of textbackground
transitions per column. We used the binarization
method in [10] prior to computing the profile features. Profile
features have shown noteworthy performance on tasks such as
handwritten and printed word spotting, but fail to cope with
the additional complexities in scene text (e.g., low contrast,
noise, blur, large intra-class variations). Infact, our results show
that gradient features substantially outperform profile based
features for scene text recognition.
Weighted Dynamic Time Warping. In our experiments we
used 30 clusters to compute the weights. Our analysis comparing
various methods are shown in Table I. We observe that
with wDTW, we achieve a high recognition accuracy on both
the datasets.
Dynamic k-Nearest Neighbor. Given a scene text image to
recognize, we retrieve word images from database of synthetic
words. The retrieval is ranked based on similarity score. In

other words, synthetic words more similar to the scene text
word get a higher rank. We use dynamic k-NN with an initial
value of k = 3 for all the experiments.
We estimate all the parameters on the train sets of respective
datasets. Code for synthetic image generation and feature
computation will be made available on our project page.3
C. Comparison with Previous Work
We retrieve synthetic word images corresponding to lexicon
words and use dynamic k-NN to assign text label to a
given scene text image. We compared our method with the
most recent previous works related to this task, and also the
commercial OCR ABBYY in Table II. From the results, we
see that the proposed holistic word matching based scheme
outperforms not only our earlier work [12], but also many
recent works as [14], [18], [19] on the SVT dataset. On the
ICDAR dataset, we perform better than almost all the methods,
except [19]. This marginally inferior performance (of about
0.3%) is mainly because our synthetic database fails to model
few of the fonts in ICDAR dataset (Fig. 5). These type of fonts
are rare in the street view images. A specific preprocessing or
more variations in the synthetic dataset may be needed to deal
with such fonts. Fig. 4 shows the qualitative performance of
the proposed method on sample images. We observe that the
proposed method is robust to noise, blur, low contrast and
background variations.
In addition to being simple, our method significantly improves
the prior art. This gain in accuracy can be attributed
to the robustness of our method, which (i) does not rely on
character segmentation rather do holistic word recognition; and
(ii) learns discriminitiveness of features in a principled way
and use this information for robust matching using wDTW.

0 0