Image captioning（一）

来源：互联网发布：罗马帝国汉朝知乎编辑：程序博客网时间：2024/06/08 17:12

规划：

整体内容是四块：

首先是训练和评估数据，我们选择的是微软COCO图片数据数据，后面简单会介绍数据的格式，处理方式
整体topic的主线是沿着show and tell的发展
介绍各种attention以及其他encoder机制的加成
evaluation方式及探讨应用

说明

代码使用tensorflow，最好升级到1.4，历史版本有很多算子是不支持GPU运算的。
有些基础的知识尽量不讲，比如SGD，RNN，CNN 这些概念，不然会使得整体线路很不清晰
尽量将比较经典的论文
对于有图像处理基础的同学，本节后面的image process 可以不用看

论文规划：

data: Microsoft COCO: Common Objects in Context
2015: Show and tell: A neural image caption generator
2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2016: SPICE: Semantic Propositional Image Caption Evaluation
2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

链接见文末

image process

数据来源：Microsoft COCO

train/val image:

http://msvocds.blob.core.windows.net/coco2014/train2014.zip
http://msvocds.blob.core.windows.net/coco2014/val2014.zip

train 大约有8w张的图片，val 大约有4w张的图片

train/val caption:

http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_train2014.json
http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_val2014.json

json格式的文件

数据格式

这个数据集提供的图像标注包括：图片分类，区域定位，语义分割，caption等，我们这里只关注caption

数据集分为两块数据：图片和json文件，通过json文件将图片和它的描述caption关联起来

images
比如我们看一张长颈鹿吃树叶的图像，效果如下：

这里写图片描述
图像的原始信息归纳：

{"license": 5, "file_name": "COCO_train2014_000000057870.jpg", "coco_url": "http://mscoco.org/images/57870", "height": 480, "width": 640, "date_captured": "2013-11-14 16:28:13", "flickr_url": "http://farm4.staticflickr.com/3153/2970773875_164f0c0b83_z.jpg","id": 57870}

图片读取，decode，resize，crop，pad 等处理代码示例：

f = tf.gfile.GFile(file, 'r')image = f.read()g = tf.Graph()with g.as_default():  # decode image  image = tf.image.decode_jpeg(image, channels=3)with tf.Session(graph=g) as sess:  # Tensor("DecodeJpeg:0", shape=(?, ?, 3), dtype=uint8)  # shape: [640 591 3]  log_tensor_shap(image, sess)  image = tf.image. \    resize_images(image,                  size=[346, 346],                  method=tf.image.ResizeMethod.BILINEAR)  # Tensor("Squeeze:0", shape=(346, 346, 3), dtype=float32)  # shape: [346 346   3]  log_tensor_shap(image, sess)  image = tf.image. \    resize_image_with_crop_or_pad(image, 269, 269)  # Tensor("Squeeze_1:0", shape=(269, 269, 3), dtype=float32)  # shape: [269 269   3]  log_tensor_shap(image, sess)  # elements 0／1 to -1/1  image = tf.subtract(image, 0.5)  image = tf.multiply(image, 2.0)

annotations
主要包括: image_id, caption 描述
上面图的几种英文描述如下：

A giraffe eating food from the top of the tree.
A giraffe standing up nearby a tree
A giraffe mother with its baby in the forest.
Two giraffes standing in a tree filled area.
A giraffe standing next to a forest filled with trees.

数据格式：

{'image_id': 318556, 'id': 48, 'caption': 'giraffe eating food from the top of the tree'}

caption和图像关联的处理代码示例：

with tf.gfile.GFile(file, 'r') as f:  caption_data = json.load(f)  id_to_filename = \    dict([(x["id"], x["file_name"])          for x in caption_data["images"]])  # Extract the captions.  # Each image_id is associated with multiple captions.  id_to_captions = {}  for annotation in caption_data["annotations"]:    image_id = annotation["image_id"]    caption = annotation["caption"]    id_to_captions.setdefault(image_id, [])    id_to_captions[image_id].append(caption)

有几个注意点：

这里的demo是为了简单化，我们都是用最基础的序列化方式读取图片方式。实际训练的时候，图片是存储在硬盘上面，不可能所有都加载到内存，是边训练，边加载的方式load图片的。如果想了解复杂模型的加载方式，可以看 google 的im2text 的数据预处理序列化方式 build_mscoco_data.py(附件一)
文本开始训练的时候需要进入embedding层，所以文本的词都需要转化成 id类型，为了节省embedding_lookup的空间，id尽量使dense 连续的。
为了demo(附件一)能够执行，我上传了小部分的数据，如果后续大家需要自己训练模型，可以直接从上面下载。

安利两个东西：
1. https://www.semanticscholar.org 是一个论文查看网站，他会统计一篇文章的reference 和 cite，还有他们的热度。对于挖掘一个topic的来龙去脉非常方便。
2. http://md.barretlee.com/ 微信code 排版工具，使用这个工具，排出来的还稍微可以看点。

附件一：
代码 git：https://github.com/Wenda-Jiang/nlp_demo/blob/master/process/process.py
代码 build_mscoco_data.py: https://github.com/tensorflow/models/blob/master/research/im2txt/im2txt/data/build_mscoco_data.py

附件二：
+ data: Microsoft COCO: Common Objects in Context
(https://www.semanticscholar.org/paper/Microsoft-COCO-Common-Objects-in-Context-Lin-Maire/5e0f8c355a37a5a89351c02f174e7a5ddcb98683)
+ 2015: Show and tell: A neural image caption generator
(https://www.semanticscholar.org/paper/Show-and-tell-A-neural-image-caption-generator-Vinyals-Toshev/11da2d589485685f792a8ac79d4c2e589e5f77bd)
+ 2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
(https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
+ 2016: SPICE: Semantic Propositional Image Caption Evaluation
(https://www.semanticscholar.org/paper/SPICE-Semantic-Propositional-Image-Caption-Evaluat-Anderson-Fernando/4a20c069a98e13f4bb93f27a829a87a75925d0e5)
+ 2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
(https://www.semanticscholar.org/paper/Knowing-When-to-Look-Adaptive-Attention-via-A-Visu-Lu-Xiong/8d42013abfcec7e7b8993106d32c84d2645c2cd9)

阅读全文

0 0