P-CNN: Pose-based CNN Features for Action Recognition
来源:互联网 发布:国外期刊数据库 编辑:程序博客网 时间:2024/05/18 01:39
2015ICCV
github :https://github.com/gcheron/P-CNN
project:http://www.di.ens.fr/willow/research/p-cnn/
原文:http://blog.csdn.net/zimenglan_sysu/article/details/49802769
- 首先利用一些state-of-art的pose estimator来提取视频里面的每帧的pose.
- 定义parts。
- 如图中的upper body full body等, 并利用pose的坐标来截取每个part的patches,这里的patches包括rgb原图和motion图. motion图就是光流图。
- 用一些经典已训练好的的CNN(s)模型来提取fc特征(如fc7的4096维特征) 。
- 用一些aggregation的方法来进一步提取特征, 使得一个视频的特征P-CNN输出纬度是固定大小的。这里的aggregation的方法有max, min, mean, max/min等. 比较有意思的是, 从实验结果来看,motion的作用远大于rgb的。
- 训练svm的action classifiers。
- 一般来说, 视频的feature descriptors往往是高维的,如P-CNN的160k-d。所以在训练svm时, 需要对特征进行降维操作,可以用PCA等这些方法。
一些实现细节:
Crop Patches
RGB图像和flow图都通过定位分为了左手lefthand、右手righthand、上身upperbody、全身fullbody、全图fullimage五个patch,每个patch会resize到224x224来匹配CNN的输入。
extract_cnn_patches(video_names,param)
% get part boxes% part CNN (fill missing part before resizing)% 尺度sc=scale(idim);% box的边长 param.lside为指定的边长。对于JHMDB是40,MPIIcooking是120lside=param.lside*sc ;% left hand% get_box_and_fill: given boxes positions and image, return the corresponding box with pixels% out of the image filled with graylhand = get_box_and_fill(positions(:,param.lhandposition,idim)-lside,positions(:,param.lhandposition,idim)+lside,im);lhand = imresize(lhand, net.normalization.imageSize(1:2)) ;% right handrhand = get_box_and_fill(positions(:,param.rhandposition,idim)-lside,positions(:,param.rhandposition,idim)+lside,im);rhand = imresize(rhand, net.normalization.imageSize(1:2)) ;% upper bodysc=scale(idim); lside=3/4*param.lside*sc ;upbody = get_box_and_fill(min(positions(:,param.upbodypositions,idim),[],2)-lside,max(positions(:,param.upbodypositions,idim),[],2)+lside,im);upbody = imresize(upbody, net.normalization.imageSize(1:2)) ;% full bodyfullbody = get_box_and_fill(min(positions(:,:,idim),[],2)-lside,max(positions(:,:,idim),[],2)+lside,im);fullbody = imresize(fullbody, net.normalization.imageSize(1:2)) ;% full image CNNf (just resize frame)fullim = imresize(im, net.normalization.imageSize(1:2)) ;
get_box_and_fill(topleft,botright,im)
function box=get_box_and_fill(topleft,botright,im)% given boxe positions and image, return the corresponding box with pixels% out of the image filled with gray% 对于左手和右手% 左上:positions(:,param.lhandposition,idim)-lside% 右下:positions(:,param.lhandposition,idim)+lside% 对于上半身和全身% topleft:min(positions(:,param.upbodypositions,idim),[],2)-lside% botright:max(positions(:,param.upbodypositions,idim),[],2)+lside[h,w,~]=size(im);% 取整topleft=round(topleft);botright=round(botright);% 每个值都是128的三维矩阵box=uint8(128*ones(botright(2)-topleft(2)+1,botright(1)-topleft(1)+1,3));% check if a part of the box is in the imageif topleft(1) > w || topleft(2) > h || botright(1) < 1 || botright(2) < 1 return % return a gray box 返回灰度boxendleft_min = max(topleft(1),1) ;top_min = max(topleft(2),1) ;right_max = min(botright(1),w) ;bot_max = min(botright(2),h) ;im_w=left_min:right_max ;im_h=top_min:bot_max ;box(top_min-topleft(2)+1:top_min-topleft(2)+length(im_h),left_min-topleft(1)+1:left_min-topleft(1)+length(im_w),:)=im(im_h,im_w,:);
提取CNN特征
- 对于RGB和flow,分别采用了两个模型:
- RGB: Imagenet的VGG-f
- flow:flow_net(pretrained on UCF101,Finding action tubes, CVPR2015)
- 上述两个模型都是5conv+3fc,得到的特征都是4096-d
-
extract_cnn_features(video_names,param)
% extract CNN features per frame for b=1:bsize:nim fprintf('%s -- feature extraction: %d\tover %d:\t',suf{i},b,nim);tic; im = vl_imreadjpeg(filelist(b:min(b+bsize-1,nim)),'numThreads', param.nbthreads_netinput_loading) ; im = cat(4,im{:}) ; im = bsxfun(@minus, im, net.normalization.averageImage) ; if param.use_gpu ; im = gpuArray(im) ; end res=vl_simplenn(net,im); fprintf('extract %.2f s\t',toc);tic; save_feats(squeeze(res(end-2).x),outlist(b:min(b+bsize-1,nim)),param); % take features after last ReLU fprintf('save %.2f s\n',toc) end
计算PCNN特征
作者的pcnn特征,采用了不同的aggregation scheme(聚合策略)。paper当中提出来的有以下几种:
- max-aggr
- max/min-aggr
- (static+dyn)(max-aggr)
- (static+dyn)(max/min-aggr)
- mean-aggr
max和min就是求取每一帧每个part的cnnfeatures当中每一维的最大或者最小值,然后串联起来。
static:把最大最小值串联起来
dyn:对于最大最小特征,每隔4帧求取cnnfeatures的差值。
compute_pcnn_features(param)
disp('In appearance') % 处理RGB图片,计算每个部分的norms(代码在下面,其实就是L2范数啦)if isfield(param,'perpartL2') && param.perpartL2 fprintf('Compute per part norms ---> '); tic; norms=get_partnorms(param.trainsplitpath,featdir_app,param); fprintf('%d sec\n',round(toc));else norms=[];end[Xn_train,Xn_test] = get_Xn_train_test(featdir_app,param,norms);disp('In flow') % 处理光流图片if isfield(param,'perpartL2') && param.perpartL2 fprintf('Compute per part norms ---> '); tic; norms=get_partnorms(param.trainsplitpath,featdir_flow,param); fprintf('%d sec\n',round(toc));else norms=[];end% 计算线性核并保存if param.compute_kernel disp('Compute Kernel Test') Ktest = Xn_test'*Xn_train; savename=sprintf('%s/Ktest.mat',param.savedir); disp(['Save test kernel in: ',savename]) assert(sum(isinf(Ktest(:)))==0 && sum(isnan(Ktest(:)))==0) save(savename,'Ktest','-v7.3') clear Ktest ; clear Xn_test ; disp('Compute Kernel Train') Ktrain = Xn_train'*Xn_train; assert(sum(isinf(Ktrain(:)))==0 && sum(isnan(Ktrain(:)))==0) savename=sprintf('%s/Ktrain.mat',param.savedir); disp(['Save train kernel in: ',savename]) save(savename,'Ktrain','-v7.3') clear Ktrain ; clear Xn_train ;%计算L2范数function norms=get_partnorms(splitpath,featdirraw,param)partids = param.partids;%% Compute norms[samplelist,numfil]=get_sample_list(splitpath,featdirraw);%samplelist 一个cell。cell(1000,1)%numfil 文件数量norms = zeros(length(partids),numfil);nframes=zeros(length(partids),numfil);parfor ii=1:numfil pathname=samplelist{ii}; tmp=load(pathname) ; norms_ii=norms(:,ii); nframes_ii=nframes(:,ii); for nd=1:length(partids) cnnf=tmp.features(partids(nd)).x;%cnnfeatures norms_ii(nd)=norms_ii(nd)+sum(sqrt(sum(cnnf.^2,2)));%计算2范数 nframes_ii(nd)=nframes_ii(nd)+size(cnnf,1);%size(cnnf,1)返回cnnf的行数 end norms(:,ii)=norms_ii; nframes(:,ii)=nframes_ii; %fprintf('NORM: %d out of %d\n',ii,numfil)endnorms=sum(norms,2);nframes=sum(nframes,2);norms=norms./nframes;
1 0
- P-CNN: Pose-based CNN Features for Action Recognition
- P-CNN: Pose-based CNN Features for Action Recognition (CNN篇)
- CNN Features for Scene Recognition-论文笔记
- CNN Features off-the-shelf: an Astounding Baseline for Recognition
- R-CNN for Pose Estimation and Action Detection
- Contextual Action Recognition with R*CNN
- R-CNN regions with CNN features for detection and segmentation
- Faster R-CNN Features for Instance Search
- CS231n CNN for Visual Recognition Module (1)
- CS231n CNN for Visual Recognition Module (2)
- CNN for Visual Recognition 学习笔记
- CS231n - CNN for Visual Recognition Assignment1 ---- KNN
- CS231n - CNN for Visual Recognition Assignment1 ---- SVM
- Two-Stream RNN/CNN for Action Recognition in 3D Videos-阅读笔记
- Contextual Action Recognition with R*CNN-论文阅读
- 行人属性“Contextual Action Recognition with R*CNN”
- Gesture/Pose/Posture/Action Recognition
- Digit Recognition via CNN
- Android应用如何判断系统升级了?
- bic,orr——设置某些位为0或者1
- JS转换时间格式
- Lambda 表达式
- [php] mondrian 使用方法
- P-CNN: Pose-based CNN Features for Action Recognition
- PaaS容器集群优化之路
- Qt5 简单使用webview 和openssl报错处理
- 串口看门狗实验
- 字符串常量池
- AJAX基础应用方式
- PAT 1002 A+B for Polynomials(简单计数)
- 设置eclipse注释模板
- pthread_mutex_lock 函数