Image captioning(一)

来源:互联网 发布:罗马帝国汉朝知乎 编辑:程序博客网 时间:2024/06/08 17:12

规划:

整体内容是四块:

  1. 首先是训练和评估数据,我们选择的是微软COCO图片数据数据,后面简单会介绍数据的格式,处理方式
  2. 整体topic的主线是沿着show and tell的发展
  3. 介绍各种attention以及其他encoder机制的加成
  4. evaluation方式及探讨应用

说明

  1. 代码使用tensorflow,最好升级到1.4,历史版本有很多算子是不支持GPU运算的。
  2. 有些基础的知识尽量不讲,比如SGD,RNN,CNN 这些概念,不然会使得整体线路很不清晰
  3. 尽量将比较经典的论文
  4. 对于有图像处理基础的同学,本节后面的image process 可以不用看

论文规划:

data: Microsoft COCO: Common Objects in Context
2015: Show and tell: A neural image caption generator
2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2016: SPICE: Semantic Propositional Image Caption Evaluation
2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

链接见文末

image process

数据来源:Microsoft COCO

train/val image:

http://msvocds.blob.core.windows.net/coco2014/train2014.zip
http://msvocds.blob.core.windows.net/coco2014/val2014.zip

train 大约有8w张的图片,val 大约有4w张的图片

train/val caption:

http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_train2014.json
http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_val2014.json

json格式的文件

数据格式

这个数据集提供的图像标注包括:图片分类,区域定位,语义分割,caption等,我们这里只关注caption

数据集分为两块数据:图片和json文件,通过json文件将图片和它的描述caption关联起来

images
比如我们看一张长颈鹿吃树叶的图像,效果如下:


这里写图片描述
图像的原始信息归纳:

{"license": 5, "file_name": "COCO_train2014_000000057870.jpg", "coco_url": "http://mscoco.org/images/57870", "height": 480, "width": 640, "date_captured": "2013-11-14 16:28:13", "flickr_url": "http://farm4.staticflickr.com/3153/2970773875_164f0c0b83_z.jpg","id": 57870}

图片读取,decode,resize,crop,pad 等处理代码示例:

f = tf.gfile.GFile(file, 'r')image = f.read()g = tf.Graph()with g.as_default():  # decode image  image = tf.image.decode_jpeg(image, channels=3)with tf.Session(graph=g) as sess:  # Tensor("DecodeJpeg:0", shape=(?, ?, 3), dtype=uint8)  # shape: [640 591 3]  log_tensor_shap(image, sess)  image = tf.image. \    resize_images(image,                  size=[346, 346],                  method=tf.image.ResizeMethod.BILINEAR)  # Tensor("Squeeze:0", shape=(346, 346, 3), dtype=float32)  # shape: [346 346   3]  log_tensor_shap(image, sess)  image = tf.image. \    resize_image_with_crop_or_pad(image, 269, 269)  # Tensor("Squeeze_1:0", shape=(269, 269, 3), dtype=float32)  # shape: [269 269   3]  log_tensor_shap(image, sess)  # elements 0/1 to -1/1  image = tf.subtract(image, 0.5)  image = tf.multiply(image, 2.0)

annotations
主要包括: image_id, caption 描述
上面图的几种英文描述如下:

A giraffe eating food from the top of the tree.
A giraffe standing up nearby a tree
A giraffe mother with its baby in the forest.
Two giraffes standing in a tree filled area.
A giraffe standing next to a forest filled with trees.

数据格式:

{'image_id': 318556, 'id': 48, 'caption': 'giraffe eating food from the top of the tree'}

caption和图像关联的处理代码示例:

with tf.gfile.GFile(file, 'r') as f:  caption_data = json.load(f)  id_to_filename = \    dict([(x["id"], x["file_name"])          for x in caption_data["images"]])  # Extract the captions.  # Each image_id is associated with multiple captions.  id_to_captions = {}  for annotation in caption_data["annotations"]:    image_id = annotation["image_id"]    caption = annotation["caption"]    id_to_captions.setdefault(image_id, [])    id_to_captions[image_id].append(caption)

有几个注意点:

  1. 这里的demo是为了简单化,我们都是用最基础的序列化方式读取图片方式。实际训练的时候,图片是存储在硬盘上面,不可能所有都加载到内存,是边训练,边加载的方式load图片的。如果想了解复杂模型的加载方式,可以看 google 的im2text 的数据预处理序列化方式 build_mscoco_data.py(附件一)
  2. 文本开始训练的时候需要进入embedding层,所以文本的词都需要转化成 id类型,为了节省embedding_lookup的空间,id尽量使dense 连续的。
  3. 为了demo(附件一)能够执行,我上传了小部分的数据,如果后续大家需要自己训练模型,可以直接从上面下载。

安利两个东西:
1. https://www.semanticscholar.org 是一个论文查看网站,他会统计一篇文章的reference 和 cite,还有他们的热度。对于挖掘一个topic的来龙去脉非常方便。
2. http://md.barretlee.com/ 微信code 排版工具,使用这个工具,排出来的还稍微可以看点。

附件一:
代码 git:https://github.com/Wenda-Jiang/nlp_demo/blob/master/process/process.py
代码 build_mscoco_data.py: https://github.com/tensorflow/models/blob/master/research/im2txt/im2txt/data/build_mscoco_data.py

附件二:
+ data: Microsoft COCO: Common Objects in Context
(https://www.semanticscholar.org/paper/Microsoft-COCO-Common-Objects-in-Context-Lin-Maire/5e0f8c355a37a5a89351c02f174e7a5ddcb98683)
+ 2015: Show and tell: A neural image caption generator
(https://www.semanticscholar.org/paper/Show-and-tell-A-neural-image-caption-generator-Vinyals-Toshev/11da2d589485685f792a8ac79d4c2e589e5f77bd)
+ 2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
(https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
+ 2016: SPICE: Semantic Propositional Image Caption Evaluation
(https://www.semanticscholar.org/paper/SPICE-Semantic-Propositional-Image-Caption-Evaluat-Anderson-Fernando/4a20c069a98e13f4bb93f27a829a87a75925d0e5)
+ 2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
(https://www.semanticscholar.org/paper/Knowing-When-to-Look-Adaptive-Attention-via-A-Visu-Lu-Xiong/8d42013abfcec7e7b8993106d32c84d2645c2cd9)

阅读全文
0 0
原创粉丝点击