Supervised Hashing for Image Retrieval via Image Represention Learning-笔记1

来源：互联网发布：大数据运维工程师简历编辑：程序博客网时间：2024/06/16 06:59

摘要

Background：

In the existing supervised hashing methods for images ,an input image is usually encoded by a vector of hand-crafted visual features.

e.g. Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs,which may often degrade the performance of hashing function learning.（人工提取的特征无法保证图片对之间的语义正确性，也会降低哈希函数学习的性能）

In this paper：

We propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions.The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images.

Introduction

The learning-based hashing methods can be divided into three main streams.

<a>Unsupervised methonds,in which only unlabeled data is used to learn hash functions.（无监督）

<b>The other two streams are semi-supervised and supervised methods.（半监督和监督）

Key question:

In learning-based hashing for images is how to encode images into a useful feature representation so as to enhance the hashing performance.

Ideally,one would like to automatically learn such a fecture representation that sufficiently preserves the semantic similarities for images during the hash learning process.

e.g. Without using hand-crafted visual features, Semantic Hashing (Salakhutdinov and Hinton 2007) is a hashing method which automatically constructs binary-code feature representation for images by a multi-layer auto-encoder, with the raw pixels of images being directly used as input.

Semantic hashing imposes difficult optimization.（语义哈希很难优化）

In this paper, we propose a supervised hashing method for image retrieval which simultaneously learns a set of hashing functions as well as a useful image representation tailored to the hashing task.

Given n images I = {I1, I2, ..., In} and a pairwise similarity matrix S in which Si,j = 1 if Ii and Ij are semantically similar and otherwise Si,j = −1, the task of supervised hashing is to learn a set of q hash functions based on S and I.

(给定的n张图片，l1,...,ln分别对应每张图片，相似矩阵S，当两张图片语义相似则对应的S[i][j]=1，otherwise -1，监督哈希的任务是基于S和I学习一系列的哈希函数q)。

Formulating the hash learning task as a single optimization problem usually leads to a complex and highly non-convex objective which may be difficult to optimize. To avoid this issue, one of the popular ways is decomposing the learning process into a hash code learning stage followed by a hash function learning stage (e.g., (Zhang et al. 2010; Lin et al. 2013)).

（为了避开非凸objective难以优化问题，把学习过程分为hash code学习阶段和hash function stage）

As shown in Figure 1, the proposed method also adopts such a two-stage paradigm. In the first stage, we propose a scalable coordinate descent algorithm to approximately decompose S into a product form S ≈ 1 /q *HHT , where H ∈ Rn×q with each of its elements being in {−1, 1}. The k-th row in H is regarded as the approximate target hash code of the image Ik. In the second stage, we simultaneously learn a set of q hash functions and a feature representations for the images in I by deep convolutional neural networks

（第一阶段：分解S ≈ 1 /q *HHT ,H ∈ Rn×q ,在H中第k行认为是第k张图片的近似hash code。第二阶段：我们用深度CNN来同时学习第I张图片的feature repressentations和q个hash functions）。

Related Work

LSH:

The early research of hashing focuses on dataindependent methods, in which the Locality Sensitive Hashing (LSH) methods (Gionis, Indyk, and Motwani 1999; Charikar 2002) are the most well-known representatives. LSH methods use simple random projections as hash functions, which are independent of the data. However, LSH

(哈希的早期研究集中于数据依赖方法，其中位置敏感哈希（LSH）方法（Gionis，Indyk和Motwani 1999; Charikar 2002）是最知名的代表。 LSH方法使用简单的随机投影作为散列函数，其独立于数据。但是，LSH需要很长的hash code 来确保它的准确性，这样就导致了需要更大的存储空间并且查全率较低。)

The Approach

Stage 1:learning approximate hash codes

We define an n by q binary matrix H whose k-th row is Hk· ∈ {−1, 1} q . Hk· represents the target q-bit hash code for the image Ik. Thegoal of supervised hashing is to generate hash codes that preserve the semantic similarities over image pairs. Specifically, the Hamming distance between two hash codes Hi· and Hj. (associated to Ii and Ij , respectively) is expected to be correlated with Sij which indicates the semantic similarity of Ii and Ij . Existing studies (Liu et al. 2012) have pointed out that the code inner product Hi·HT j. has one-to-one correspondence to the Hamming distance between Hi· and Hj. Since Hi· ∈ {−1, 1} q , the code inner product Hi·HT j. is in the range [−q, q]. Hence, as shown in Figure 1(Stage 1), we learn the approximate hash codes for the training images in I by minimizing the following reconstruction errors:

H:n*q的二进制矩阵，第k行是由q个{-1,1}组成的，q位hash codes代表了第k张图片。內积Hi·HT j的取值范围{-q,q}。我们对第I张图片通过最小化重构误差来学习近似hash codes.

注意：因为公式编写太浪费时间请直接看论文中的公司，以下如此！

论文中公式（2）。。。

公式（2）很难优化，所以提出了公式（3），重要改变是H中的值由原来的{-1,1}变为[-1,1],但是公式（3）由于是个非凸问题， We propose to solve it by a coordinate descent algorithm using Newton directions.这个算法顺序的或者随机的选择H中的一项来更新。

以下的算法部分都省去了，2015年提出的方法都抛弃了第一阶段的工作。

Stage 2:learning image feature representation and hash functions

The main task is learning a feature representatin for training images as well as a set of hash functions.

In many scenarios of supervised hashing for images, the similarity/dissimilarity labels on image pairs are derived from the discrete class labels of the individual images. That is, in hash learning, the discrete class labels of the training images are usually available. Here we can design the output layer of our network in two ways, depending on whether the discrete class labels of the training images are available.In the first way, given only the learned hash code matrix H with each of its rows being a q-bit hash code for a training image, we define an output layer with q output units (the red nodes in the output layer in Figure 1(Stage 2)), each of which corresponds to one bit in the target hash code for an image. We denote the proposed hashing method using a CNN with such an output layer (with only the red nodes, ignoring the black nodes and the associated lines) as CNNH。

(在许多图像的监督散列场景中，从图像对的离散类标签导出图像对上的相似性/不相似性标签。也就是说，在哈希学习中，训练图像的离散类标签通常是有用的。在这里，我们可以以两种方式设计我们网络的输出层，这取决训练图像的离散类标签是否可用。在第一种方式中，仅给出学习的哈希码矩阵H，其中每行都是q位，我们定义具有q个输出单元（图1（阶段2）的输出层中的红色节点）的输出层，每个输出层对应于图像的目标哈希码中的一个比特。我们使用具有这样的输出层（仅具有红色节点，忽略黑色节点和相关联的线）的CNN来表示所提出的散列法作为CNNH)

In the second way, we assume the discrete class labels of the training images are available. Specifically, for n training images in c classes (an image may belong to multiple classes), we define a n by c discrete label matrix Y ∈ {0, 1} n×c, where Yij = 1 if the i-th training image belongs to the j-th class, otherwise Yij = 0. For the output layer in our network, in addition to defining the q output units (the red nodes in the output layer) corresponding to the hash bits as in the first way, we add c output units (the black nodes in the output layer in Figure 1(Stage 2)) which correspond to the class labels of a training images. By incorporating the image class labels as a part in the output layer, we enforce the network to learn a shared image representation which matches both the approximate hash codes and the image class labels. It can be regarded as a transfer learning case in which the incorporated image class labels are expected to be helpful for learning a more accurate image representation (i.e. the hidden units in the fully connected layer). Such a better image representation may be advantageous for hash functions learning. We denote the proposed method using a CNN with the output layer having both the red nodes and the black nodes as CNNH+.

在第二种方式中，我们假设训练图像的离散类标签可用。具体来说，对于c类中的n个训练图像（图像可以属于多个类），定义了n*c大小的离散标签矩阵Y∈{0,1} n×c，其中如果第i个训练图像属于到第j类则Yij = 1，否则Yij = 0。对于我们网络中的输出层，除了以第一种方式定义对应于散列位的q个输出单元（输出层中的红色节点）之外，我们增加了c个输出单元（图1（阶段2）中的输出层中的黑色节点），其对应于训练图像的类别标签。通过将图像类标签作为输出层中的一部分，我们丰富网络学习共享图像表示，其匹配近似散列码和图像类标签。它可以被认为是其中所结合的图像类标签被期望有助于学习更精确的图像表示（完全连接层中的隐藏单元）的转移学习情况。这种更好的图像表示对于散列函数学习可能是有利的。我们使用CNN来表示所提出的方法，其中具有红色节点和黑色节点的输出层都是CNNH +。例如：Y矩阵如下：4*3

1 2 3

csdn blog:http://blog.csdn.net/liyaohhh/article/details/51853385

论文的slide地址：http://ss.sysu.edu.cn/~py/CNNH-slides.pdf

论文地址：http://ss.sysu.edu.cn/~py/papers/AAAI-CNNH.pdf

0 0