[paper]summary(event detection)

来源:互联网 发布:免费淘宝客网站源码 编辑:程序博客网 时间:2024/06/03 11:02

ACMMM

ACMMM2015_EventNet: A Large Scale Structured Concept Library for Complex Event

从”WIKIHOW”里选出了500个event(共95,321 videos, 4490 concepts).
然后用这些concepts和events建了一个CONCEPT LIBRARY, 就是个树状的.

ACMMM2015_Multimedia Event Detection Using Event-Driven Multiple Instance Learning

We treat each segment as an instance and model it in a multiple instance learning framework(MIL.多示例学习), where each video is a “bag”. The instance-event similarity(the importance of a segment to an event can be obtained by matching its detected concepts against the evidential description of that event.) is quantized into different levels of relatedness. Intuitively, the most (ir)relevant instances should have higher (dis)similarities. Therefore, we propose to learn the instance labels by jointly optimize the instance classifier and its related level.

ACMMM2015_Searching Persuasively: Joint Event Detection and Evidence Recounting

CVPR

CVPR2013_Complex Event Detection via Multi-Source Video Attributes

说是attributes at image level是不够的,这篇文章提出了用attributes at video level(叫做video attributes): the semantic labels of other external videos collected by researchers. Note that these external videos are different from complex event videos. The external videos contain simple contents of people, objects, scenes and actions which are basic elements of complex events.
As the external videos are used by treating their semantic labels as video attributes, we call these videos attribute videos.

how to use video attributes?

we propose to use video attributes as additional information to assist complex event detection. Specifically, our framework learns the attribute classifier and event detector simultaneously. The observation of a particular event affects the attribute classifier, and in return, attributes characterize the event. This kind of mutual influence is explored by a correlation vector, which helps incorporate extra informative cues into the event detector.

CVPR2013_Event Recognition in Videos by Learning From Heterogeneous Web Sources

The goal is to learn a robust target classifier by using the loosely labeled single-view data from the heterogeneous source domains[web videos (e.g., from YouTube) and web images (e.g., from Google/Bing image search)] and the unlabeled multi-view data from the target domain.
这里写图片描述
Observing that some source domains are more relevant to the target domain, in Section 3, we propose a new method called Multi-domain Adaptation with Heterogeneous Sources (MDA-HS) to effectively cope with heterogeneous sources. Specifically, we seek the optimal weights for different source domains with different types of features and also infer the labels of unlabeled target domain data based on all types of features. For each source domain, we propose to learn an adapted classifier based on the pre-learnt source classifier with data distribution mismatch, for which we minimize the distance between the two classifiers in terms of their weight vectors. We introduce a new regularizer by summing the weighted distances from all the source domains and combine all the weighted adapted classifiers as a new target classifier. We also propose a new ρ-SVM based objective function by using the new regularizer and target classifier for domain adaptation. We develop an iterative optimization method by using the cutting plane method and solving a group-based multiple kernel learning (MKL) problem.

CVPR2013_Event retrieval in large video collections with circulant temporal encoding

what is event retreval?

Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos.

methods

Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain.

CVPR2014_DISCOVER: Discovering Important Segments for Classification of Video Events

what does it do?

Given a query video, our framework provides not only a high level event label (e.g. a wedding ceremony), but also video segments which are important positive evidence and their textual descriptions (e.g. people hugging).

framework

segmenting videos into short clips and applying pre-trained primitive action classifiers to each clip. The presence of an event is determined by: (1) Global video representation, generated by pooling visual features over the entire video, (2) the presence of several pieces of different positive evidence which are consistent over time.
Our model consists of a global template, a set of local evidence templates and a temporal transition constraint of evidence set. Given a video, we find the sequence of video segments which achieve best overall score in matching the evidence templates and meeting the temporal constraints. An event label is assigned based on the global feature of a video, as well as features from the selected pieces of evidence.
**这里写图片描述**
大概是先把视频分段成clips,然后用N个action classfier对每个clip进行打分->每个clip都有N个得分(每个clip都对应N个action).->对这些actions的顺序啊特征啊之类的建模

CVPR2014_Event Detection using Multi-Level Relevance Labels and Multiple Features

Due to the complex attribute of an event, it is comparatively hard to find positive exemplars which exactly match the definition of the event. However, it is easier to find videos that match the definition partially, which is referred as related exemplars in this paper.
For examples:
1. “Man performs an oil change on a motor-cycle” is marked as a related exemplar to the event “Chang-ing a vehicle tire”.
2. A video depicted as “A dog lies in the grass” is considered as a related exemplar to the event “Grooming an animal”.
对于related video, 不能把它们当成positive也不能当成negative.这篇文章提出to adaptively learn the relevance level of each related video and leverage the related videos of high relevance to infer a robust detector. 不用二分类的标签(+1,-1),而采用ordinal labels to differentiate relevance levels of related videos.
Specifically, if we use total R (R ≥ 3) ordinal labels to denote the R relevance levels, we assign 1 as negative label, and R as positive label. The numbers between 1 and R correspond to related videos. A larger ordinal label indicates a higher relevance level. The ordinal labels close to 1 are the labels with low relevance to the event, and the ordinal labels close to R are the labels with high relevance.

framework

To progress beyond the state of the art, we propose a cross-feature reasoning approach to generate a set of candidate labels for all related videos and then adaptively select an optimal ordinal label for each of them. After assigning one candidate label to each video, we enumerate possible combinations of all the related videos. We then learn an optimal weight to each label combination. In conjunction with a kernel matrix, each label combination can be used to train a model for event detection. Given multiple label combinations, we have multiple models. Then we formulate the label weighting problem in a multiple kernel learning fashion to obtain a unified event detector, where maximum margin criterion is applied to learn (R − 1) discriminative boundaries between each pair of consecutive ordinal labels. To make the results more robust, we propose to recursively update the label combinations. Once we get the unified event detector, we use it to predict the labels of related videos and update the label combinations, which are then used for another round of learning. The procedure is repeated until convergence and the final unified detector is used for event detection.

CVPR2014_Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concept

In our approach, a video is decomposed into a sequence of overlapping fixed-length temporal clips, on which low-level feature detectors are applied. Each clip is then represented as a histogram (bag-of-visual-words) which is used as a clip level feature and tested against a set of pre-trained action concept detectors. Real-valued confidence scores, pertaining to the presence of each concept are recorded for each clip, converting the video into a vector time series.

video --> fixed-length clipsclip --> feature --(action concepts detectors)--> a vector time series(confidence scores for each concept detectors)

CVPR2014_Temporal Sequence Modeling for Video Event Detection

our goal is to break a visual sequence into segments of varied lengths and label them with events of interest or a null event.
The sequence model is built upon visual words (sub-events), not on annotated events, thus it does not require ground truth.
we first represent a video by a sequence of visual words learnt from our data in an unsupervised way with k-means clustering. We then apply the Sequence Memoizer (SM) [21] to explore temporal dependencies among the visual words in the sequence.
SM-based sequence model is empowered with the ability to predict the occurrence of a subsequent visual word in a sequence conditioned on all its previous contexts observed. We finally integrate the sequence model and event classification into a framework that performs segmentation and classification of events jointly in a video.

CVPR2014_Video Event Detection by Inferring Temporal Instance Labels

Mul-instance Learning: video consists of instances.
Our key assumption is that the positive videos usually have a large portion of positive instances, while the negative videos have few positive instances.
这里写图片描述

CVPR2015_A Discriminative CNN Video Representation for Event Detection

主要是在VGG16上.
提了frame feature之后,encoding成video feature

CVPR2015_Beyond Gaussian Pyramid: Multi-Skip Feature Stacking for Action Recognition

feature:IDT+FV
framework(?):extract feature with multiple time skips

CVPR2015_DevNet: A Deep Event Network for Multimedia Event Detection

cnn model: similar to “imagenet classification with deep convolutional neural networks(12’)
details: pretrained on imagenet14,finetuned on MED
稍微修改了一下cnn model的结构,就可以train videos了.

Video Event Recognition with Deep Hierarchical Context Model

surveillance videos.
hirarchical model
to be continued

CVPR2015_Visual Recognition by Counting Instances: A Multi-Instance Cardinality Potential Kernel

main:

  1. To determine whether an event is present in a long internet video, one could count how many frames seem to contain the activity.
  2. We encode our intuition about these counting relations in a multiple instance learning framework. Our approach models a bag of instances with a probabilistic model with a cardinality-based clique potential between the instance labels. This cardinality potential facilitates defining any cardinality relations between the instance labels and efficient and exact solutions for both maximum a posteriori (MAP) and sum-product inference

Proposed Method: Cardinality Kernel

We propose a novel kernel for modeling cardinality relations, counting instance labels in a bag – for example the number of people in a scene who are performing an action.
Each video is a bag comprised of individual frames. The goal is to label a video according to whether a high-level event of interest is occurring in the video or not.

CVPR2015_Visual Recognition by Learning From Web Data: A Weakly Supervised Domain Generalization Approach

domain

  1. we refer to the training dataset and testing dataset as the source domain and target domain, respectively. When target domain data is unavailable during the training process, the domain adaptation problem becomes another related task called domain generalization, which aims to learn robust classifiers that can generalize well to any unseen target domain.
  2. The datasets containing the photos/videos from each user can be considered as one target domain because different users may use different cameras to capture the photos/videos in their own ways. So we have a large number of target domains from various users and meanwhile some users may not be willing to share their photos/videos to others as target domain data due to the privacy issue.

main

In this work, we formulate a new weakly supervised domain generalization approach for visual recognition by using loosely labeled web images/videos as training data.
1. coping with noise in the labels of training web images/videos in the source domain;
We formulate a multi-instance learning (MIL) problem by selecting a subset of training samples from each training bag and simultaneously learning the optimal classifiers based on the selected samples.
2. enhancing generalization capability of learnt classifiers to any unseen target domain.
We assume the training web images/videos may come from multiple hidden domains with different data distributions. we aim to learn one classifier for each class and each latent domain. As each classifier is learnt from the training samples with a distinctive data distribution, each integrated classifier, which is obtained by combining multiple classifiers from each class, is expected to be robust to the variation of data distributions, and thus can be well generalized to predict test data from any unseen target domain.

CVPR2016_They Are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers

[trecvid 0ex: only textual descritiption, without training videos.]
Our system is built on the observation that an event is a composition of multiple mid-level concepts. These concepts are shared among events and can be collected from other sources (not necessarily related to the event search task). We then train a skip-gram language model to automatically identify the most relevant concepts to a particular event of interest.
not all concept classifiers are equally reliable, especially when they
are trained from other source domains. a relevant concept can be of limited use or even misuse if its classifier is highly unreliable. Therefore, when combining concept scores, we propose to take their relevance, predictive power, and reliability all into account.

concept detector

Since concepts are shared among many different classes (events) and each concept classifier can be trained independently on datasets from other sources, semantic event search can be achieved by combining the relevant concept classification scores,

Semantic concept relevance

learning a relevance score between the event description and the pre-trained concept (attribute) classifiers.
output:a relevance vector w[0,1]m, , where wk measures a priori relevance of the k-th concept and the event of interest.

Concept pruning and refining

We further prune and refine these weights(relevance scores) for the following reasons:
1). Some concepts, although relevant to the event of interest, may not be very discriminative (low predictive power). 2). Some concepts may not be very reliable, possibly because they are trained on different domains.
how to:
we use the (unlabeled) MED 2014 Research dataset 2 to crudely refine the concepts as follows: We first compute a similarity score between the concept names and the text description of each video in the research dataset, which acts as a concept label, i.e. the likelihood of each video to contain a particular concept. Then we run concept classifiers on each video in the research dataset, and use the aforementioned concept labels to compute the average precisions. Concepts with low precision or low predictive power (such as concept people) are then dropped.

Combine the classifier ensemble

Suppose for event e we have selected m concepts(Different events may use different concepts), each with a weight wi[0,1],i=1,...,m. Then, for any test video v, the i-th concept classifier generates a confidence score si(v)[1,1]. Since different concept classifiers result in different confidence scores, we need a principled way to combine them, preferably also taking their relevance w into account.

s=i=1mmisi

CVPR2017_ER3: A Unified Framework for Event Retrieval, Recognition and Recounting

some definitions

Previous research was pursued in both unsupervised and supervised settings. Unsupervised models were typically used for event retrieval where the goal is to retrieve all the related videos in the database in some sense similar to the query video provided by a user. On the other hand, supervised learning has been used in event recognition or detection in similar ways as in action recognition and general video classification. In this latter case, a classifier is learned from annotated training videos to detect and recognize the event categories of the test videos, e.g., the multimedia event detection task of the TRECVID. In practical applications, it is often important to qualify the event category prediction by providing an explanation for it. In particular the system needs to localize the key pieces of evidence that lead to the recognition decision. This is some times referred to as event recounting.

framework

这里写图片描述
In ER3, (i) we introduce a feature alignment step which can significantly suppress the redundant information and generate a more comprehensive and compact video representation called video imprint. In addition, the video imprint also preserves the local spatial layout among video frames. (ii) Based on the video imprint, we further employ a reasoning network, a modified version of the neural memory network, which can simultaneously recognize the event category and locate the key pieces of evidence for the event category.

Feature alignment

  1. Why do we need “feature alignment”?
    一般是把frame feature做个平均得到video feature.这篇文章举了个例子,Since those frames share similar content, simply averaging frame descriptors may lead to over-emphasis of these descriptors and reduce the discriminative power of the video representation.
  2. The idea of feature alignment comes from panoramic stitching
    panoramic stitching: can stitch images into a full view panorama, removing the overlap among the input images. If we could generate an equivalent panoramic representation from video frames of an event, the redundancy across frames would be removed and the video representation would be less sensitive to the frequency of the repetition of the less discriminating features.

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Fast Video Classification via Adaptive Cascading of Deep Models

ECCV

ECCV2012_Complex Events Detection using Data-driven Concepts

学习了一个video representaiton.

ECCV2012_Local Expert Forest of Score Fusion for Video Event Classification

A bank of base classifiers follow, each of which are trained to produce a likelihood score based on a subset of the features. Their outputs for a particular event are then fused by our method, and the resulting fused likelihood is used to rank the clips in the archive relative to the operator’s interest.
提各种feature然后训练了很多分类器.在test时,每个分类器都有一个得分,要fuse这些得分.

ECCV2012_Multi-channel Shape-Flow Kernel Descriptors for Robust Video Event Detection and Retrieval

ECCV2014_Sequential Max-Margin Event Detectors

训练了一个多分类器(一般是二分类,one-versus-all)?
SMMED is a maximum margin classifier learned using partial segments of training events. Unlike existing approaches, SMMED can sequentially select the most likely subset of classes while automatically enforcing a larger margin for the unlikely classes. As a result, SMMED can reliably discard many classes using only partially observed events.
这里写图片描述
Fig. 1. Given a test event (sequence of a subject playing the violin in the top of the figure), SMMED sequentially evaluates partial events at {10%, 20%, · · · , 100%} . When SMMED is confident that the event is not from a given class, it automatically discards this class from further consideration. The blue bars illustrate that class #2(IceDancing), #4(BlowDryHair),#5(Blending), #3(Shaving) are sequentially discarded. Finally, the test event is identified as class #1(playing the violin): the remaining class (the longest blue bar), after 80% of the event has been evaluated.

ICCV

ICCV2013_ACTIVE: Activity Concept Transitions in Video Event Classification


Figure 1. Illustration of our approach. (Top) A video from Wedding ceremony event is separated into a sequence of clips, each of which corresponds to an activity concept like kissing and dancing. (Bottom) Each dimension in our representation corresponds to an activity concept transition. A positive value indicates the transition is more likely to happen than parameterized by a Hidden Markov Model.

Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently.

中心思想是activity concept transitions(?).
1. to encode activity concept transitions with Fisher kernel techniques.
2. we use Hidden Markov Model (HMM) as the underlying generative model. In this model, a video event is a sequence of activity concepts(concepts应该就是states). A new concept is generated with certain probabilities based on the previous concept. An observation is a low level feature vector from a sub-clip and generated based on the concepts.

Video Representation

训练了K个concept detector.
对每个视频分成T段, 每段让K个detecotrs打分,最后就得到了一个TxK的矩阵.

HMM Fisher Vector


1. We use HMM to model a video event with activity concept transitions over time.
2. The idea of Fisher kernel is first proposed in [7], the goal is to get the sufficient statistics of a generative model, and use them as kernel functions in discriminative classifiers such as SVM. (???) 是用FV得到video representation,然后训练了个SVM吗?

TO BE CONTINUED