Spatio-Temporal Attention Models for Grounded Video Captioning
来源:互联网 发布:vb label 透明 编辑:程序博客网 时间:2024/05/21 07:01
Spatio-Temporal Attention Models for Grounded Video Captioning
(Submitted on 17 Oct 2016 (v1), last revised 18 Oct 2016 (this version, v2))
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.
Submission history
From: Mihai Zanfir [view email][v1] Mon, 17 Oct 2016 08:05:12 GMT (6260kb,D)
[v2] Tue, 18 Oct 2016 08:27:23 GMT (6260kb,D)
阅读全文
0 0
- Spatio-Temporal Attention Models for Grounded Video Captioning
- Video Captioning with Multi-Faceted Attention
- Learning hierarchical spatio-temporal features for action recognition with ISA
- A visual analytics framework for spatio-temporal analysis and modelling
- Multimodal Memory Modelling for Video Captioning
- Deep Learning for Video Classification and Captioning
- Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation 论文笔记
- Hierarchical Boundary-Aware Neural Encoder for Video Captioning
- Reading papers_16(Learning hierarchical invariant spatio-temporal features for action recognition wi
- Spatio-temporal Fastmap-based Mapping for Human Action Recognition (动作识别 1)
- 论文笔记:Evaluation of local spatio-temporal features for action recognition
- 车辆计数“FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras”
- 车辆计数--FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras
- 论文笔记:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
- Quoc.Le:Hierarchical invariant spatio-temporal features
- 【视频变化检测】2017CVPR Spatio-Temporal Self-Organizing Map Deep Network for Dynamic Object Detection from
- Weakly Supervised Dense Video Captioning
- 论文笔记:Image Captioning with Semantic Attention
- 默认校验规则
- PHP中PSR-[0-4]代码规范
- js练习题:猜数1~10之间
- Java 获取当前时间
- 【deep learning】深度学习用于图片的分类和检测总结
- Spatio-Temporal Attention Models for Grounded Video Captioning
- Anroid自定义View-绘制圆环
- 集群、分布式、负载均衡区别
- 程序在加载类的时候走的流程
- XMLHttpRequestjQuery.post(url, [data], [callback], [type])
- 跨域解决方式
- MySQL修改数据表名
- CRC校验,ARP协议及ARP抓包
- 关于python方法绑定