Image captioning(一)
来源:互联网 发布:罗马帝国汉朝知乎 编辑:程序博客网 时间:2024/06/08 17:12
规划:
整体内容是四块:
- 首先是训练和评估数据,我们选择的是微软COCO图片数据数据,后面简单会介绍数据的格式,处理方式
- 整体topic的主线是沿着show and tell的发展
- 介绍各种attention以及其他encoder机制的加成
- evaluation方式及探讨应用
说明
- 代码使用tensorflow,最好升级到1.4,历史版本有很多算子是不支持GPU运算的。
- 有些基础的知识尽量不讲,比如SGD,RNN,CNN 这些概念,不然会使得整体线路很不清晰
- 尽量将比较经典的论文
- 对于有图像处理基础的同学,本节后面的image process 可以不用看
论文规划:
data: Microsoft COCO: Common Objects in Context
2015: Show and tell: A neural image caption generator
2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
2016: SPICE: Semantic Propositional Image Caption Evaluation
2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
链接见文末
image process
数据来源:Microsoft COCO
train/val image:
http://msvocds.blob.core.windows.net/coco2014/train2014.zip
http://msvocds.blob.core.windows.net/coco2014/val2014.zip
train 大约有8w张的图片,val 大约有4w张的图片
train/val caption:
http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_train2014.json
http://msvocds.blob.core.windows.net/annotations-1-0-3/annotations/captions_val2014.json
json格式的文件
数据格式
这个数据集提供的图像标注包括:图片分类,区域定位,语义分割,caption等,我们这里只关注caption
数据集分为两块数据:图片和json文件,通过json文件将图片和它的描述caption关联起来
images
比如我们看一张长颈鹿吃树叶的图像,效果如下:

图像的原始信息归纳:
{"license": 5, "file_name": "COCO_train2014_000000057870.jpg", "coco_url": "http://mscoco.org/images/57870", "height": 480, "width": 640, "date_captured": "2013-11-14 16:28:13", "flickr_url": "http://farm4.staticflickr.com/3153/2970773875_164f0c0b83_z.jpg","id": 57870}
图片读取,decode,resize,crop,pad 等处理代码示例:
f = tf.gfile.GFile(file, 'r')image = f.read()g = tf.Graph()with g.as_default(): # decode image image = tf.image.decode_jpeg(image, channels=3)with tf.Session(graph=g) as sess: # Tensor("DecodeJpeg:0", shape=(?, ?, 3), dtype=uint8) # shape: [640 591 3] log_tensor_shap(image, sess) image = tf.image. \ resize_images(image, size=[346, 346], method=tf.image.ResizeMethod.BILINEAR) # Tensor("Squeeze:0", shape=(346, 346, 3), dtype=float32) # shape: [346 346 3] log_tensor_shap(image, sess) image = tf.image. \ resize_image_with_crop_or_pad(image, 269, 269) # Tensor("Squeeze_1:0", shape=(269, 269, 3), dtype=float32) # shape: [269 269 3] log_tensor_shap(image, sess) # elements 0/1 to -1/1 image = tf.subtract(image, 0.5) image = tf.multiply(image, 2.0)
annotations
主要包括: image_id, caption 描述
上面图的几种英文描述如下:
A giraffe eating food from the top of the tree.
A giraffe standing up nearby a tree
A giraffe mother with its baby in the forest.
Two giraffes standing in a tree filled area.
A giraffe standing next to a forest filled with trees.
数据格式:
{'image_id': 318556, 'id': 48, 'caption': 'giraffe eating food from the top of the tree'}
caption和图像关联的处理代码示例:
with tf.gfile.GFile(file, 'r') as f: caption_data = json.load(f) id_to_filename = \ dict([(x["id"], x["file_name"]) for x in caption_data["images"]]) # Extract the captions. # Each image_id is associated with multiple captions. id_to_captions = {} for annotation in caption_data["annotations"]: image_id = annotation["image_id"] caption = annotation["caption"] id_to_captions.setdefault(image_id, []) id_to_captions[image_id].append(caption)
有几个注意点:
- 这里的demo是为了简单化,我们都是用最基础的序列化方式读取图片方式。实际训练的时候,图片是存储在硬盘上面,不可能所有都加载到内存,是边训练,边加载的方式load图片的。如果想了解复杂模型的加载方式,可以看 google 的im2text 的数据预处理序列化方式 build_mscoco_data.py(附件一)
- 文本开始训练的时候需要进入embedding层,所以文本的词都需要转化成 id类型,为了节省embedding_lookup的空间,id尽量使dense 连续的。
- 为了demo(附件一)能够执行,我上传了小部分的数据,如果后续大家需要自己训练模型,可以直接从上面下载。
安利两个东西:
1. https://www.semanticscholar.org 是一个论文查看网站,他会统计一篇文章的reference 和 cite,还有他们的热度。对于挖掘一个topic的来龙去脉非常方便。
2. http://md.barretlee.com/ 微信code 排版工具,使用这个工具,排出来的还稍微可以看点。
附件一:
代码 git:https://github.com/Wenda-Jiang/nlp_demo/blob/master/process/process.py
代码 build_mscoco_data.py: https://github.com/tensorflow/models/blob/master/research/im2txt/im2txt/data/build_mscoco_data.py
附件二:
+ data: Microsoft COCO: Common Objects in Context
(https://www.semanticscholar.org/paper/Microsoft-COCO-Common-Objects-in-Context-Lin-Maire/5e0f8c355a37a5a89351c02f174e7a5ddcb98683)
+ 2015: Show and tell: A neural image caption generator
(https://www.semanticscholar.org/paper/Show-and-tell-A-neural-image-caption-generator-Vinyals-Toshev/11da2d589485685f792a8ac79d4c2e589e5f77bd)
+ 2015: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
(https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
+ 2016: SPICE: Semantic Propositional Image Caption Evaluation
(https://www.semanticscholar.org/paper/SPICE-Semantic-Propositional-Image-Caption-Evaluat-Anderson-Fernando/4a20c069a98e13f4bb93f27a829a87a75925d0e5)
+ 2017: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
(https://www.semanticscholar.org/paper/Knowing-When-to-Look-Adaptive-Attention-via-A-Visu-Lu-Xiong/8d42013abfcec7e7b8993106d32c84d2645c2cd9)
- Image captioning(一)
- CNN-LSTM Image Captioning
- Image captioning(二)
- Image captioning(三)
- Fluency-Guided Cross-Lingual Image Captioning
- 论文笔记:Image Captioning with Semantic Attention
- image captioning-Show and Tell: A Neural Image Caption Generator
- MAT: A Multimodal Attentive Translator for Image Captioning
- Deep Reinforcement Learning-based Image Captioning with Embedding Reward
- Improved Image Captioning via Policy Gradient optimization of SPIDEr
- Deep Reinforcement Learning-based Image Captioning with Embedding Reward
- 论文笔记:Self-critical Sequence Training for Image Captioning
- Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge代码
- 【论文笔记】Show and Tell: Lesson learned from the 2015 MSCOCO Image Captioning Challenge
- 论文笔记:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
- 17-11-22 Deep Reinforcement Learning-based Image Captioning with Embedding Reward论文随笔
- Show and Tell Lessons learned from the 2015 MSCOCO Image Captioning Challenge论文及tensorflow源码解读
- Show and Tell Lessons learned from the 2015 MSCOCO Image Captioning Challenge论文及tensorflow源码解读(2)
- Halide学习笔记----Halide tutorial源码阅读4
- 咳咳咳,看个题
- 使用Quartz进行动态添加、修改和删除Job的工具类
- linux基础指令
- response设置响应编码
- Image captioning(一)
- 我们一起学ABAP (02)~基础语法与数据类型,输出到屏幕
- php同比增加函数
- 对于进程、线程和协程的理解
- 学习
- 开始记录一些东西
- js逻辑非!的运算规则
- Ionic3环境配置
- 《A Convolutional Neural Network Cascade for Face Detection》