[深度学习论文笔记][Video Classification] Large-scale Video Classification with Convolutional Neural Networks

来源：互联网发布：知乎赚钱编辑：程序博客网时间：2024/05/18 03:47

Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014. (Citations: 654).

1 Spatio-Temporal CNN

We treat every video as a bag of short, fixed-sized clips (15 frames in our case). Since each clip contains several contiguous frames in time, we can extend the connectivity of the network in time dimension to learn spatio-temporal features. There are four fuse information across temporal domain. See Fig.

[Single-frame] Process each single frame independently.
Late Fustion] Place two separate single-frame neworks with shared parameters a distance of 15 frames apart, and then merges the two streams in the first fully connected layer, which can compute global motion characteristics by comparing outputs of both networks.
[Early Fusion] Modify the filters on conv1 in the single-frame network by extending them to be size (DT) × F H × F W .
[Slow Fusion] This is a balance between late fustion and slow fusion, in which higher layers get access to progressively more global information in both spatial and temporal dimensions. This is implemented by extending the connectivity of all convolutional layers in time dimension and carrying out temporal convolutions in addition to spatial convolutions to compute activations. This model turns to work best.

2 Multi-resolution CNNs

We want to speed up the networks. However, simply reducing the nuber of layers or neurons or training with lower resolution will hurt the performance. We proposed multi-

resolution CNN which composed by two separate streams. The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels), while the fovea stream receives the center 89 × 89 region at the original resolution. In this way, the the total input dimensionality is halved. Notably, this design takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center regio he activations from both streams are concatenated and fed into the first fully connected layer with dense connections. See Fig.

3 Results
The single-frame model already displays strong performance, suggesting that local motion cues may not be critically important.

4 References
[1]. https://www.youtube.com/watch?v=qrzQ_AB1DZk.

[2]. http://techtalks.tv/talks/large-scale-video-classification-with-convolutional-neural-networks-2/60272/.
[3]. https://vimeo.com/101555393.

0 0