LARC DL笔记（三）：finetune googlenet on food-101 VS baseline

来源：互联网发布：查看淘宝产品价格变化编辑：程序博客网时间：2024/06/15 01:32

0x00 object

food101数据集使用Googlenet train好了与baseline比较

0x01 有用信息

（1）loss很低，但是accuracy一直在50%左右一般是什么原因

如果loss一直降低而你的validation accuracy不升高的话，就是overfit了

（2）现在foodai的accuracy有多高，用了多少数据

六七十，差不多每一类留了50张测试。其他都是训练

（3）food101的baseline

food101现在googlenet v1应该是80/95，论文wide-slice residual network for food recognition0

（4）caffe文件夹

这里写图片描述

0x02 Guide

How to run the experiments using AlexNet/GoogLeNet on Food-101?

clone this repo from scratch: git clone https://github.com/deercoder/DeepFood.git
configure the environment according to the official tutorial. Minor changes have been applied in this repo.
download pre-trained model(alexNet, googleNet), under the ./models folder
download imagenet mean file, under data/ilsvrc12/ folder with the get_ilsvrc_aux.sh
run the model from the caffe’s root directory, with ./models/finetune-food101-alexNet/train_full.sh or ./models/finetune-food101-googlenet/train_full.sh, check results!

详见：https://github.com/deercoder/DeepFood

0x03 DGX docker

因为实验室服务器down掉了，改用DGX

教程：
https://philipskokoh.github.io/blog/nvidia-docker-for-your-GPU-application-development

创建docker

nvidia-docker run -it --name <username>-tensorflow -v /mnt/StorageArray2_DGX1/<username>/codes:/opt/codes compute.nvidia.com/nvidia/tensorflow bash

对于我

nvidia-docker run -it --name huwang-ncaffe -v /mnt/StorageArray2_DGX1/huwang:/home/huwang compute.nvidia.com/nvidia/caffe bash

nvidia-docker run -it --name huwang-bcaffe -v /mnt/StorageArray2_DGX1:/home zh-caffe bash

start和打开docker

## run bash shell in tensorflow container## the container will not be removed after it exits$ nvidia-docker run -it --name my-tensorflow compute.nvidia.com/nvidia/tensorflow bash## start and connect back to previously created container my-tensorflow$ nvidia-docker start my-tensorflow$ nvidia-docker attach my-tensorflow## delete the container$ nvidia-docker rm my-tensorflow

nVidia有维护一个caffe，与BVLC不太一样，有时候出问题
如果要用BVLC就需要自己搭环境

0x04 finetune过程

Step1：

下载 pre-trained model(googleNet), 地址在caffe 的 ./models 文件夹中

Step2：

下载 imagenet mean file, 使用 data/ilsvrc12/ 下的 get_ilsvrc_aux.sh

这里有另一种说法应该用新数据的mean，个人觉得，如果新数据量小，应该用旧的，但是如果新数据量大，mean也会往新数据的方向靠，应该设置为新的mean

Step3：

将文件转为lmdb文件

Step4：

准备好 train_val.prototxt（修改mean与source）、solver.prototxt、train.sh 准备train

手动make food101 mean

#!/usr/bin/env sh# Compute the mean image from the imagenet training lmdb# By BillDBNAME=.ListPath=.TOOL=/usr/bin/caffe_compute_image_mean$TOOL $DBNAME/train_lmdb \  $ListPath/food101_mean.binaryprotoecho "Done."

0x05 调参

跑了一晚上（10 pm — 1 pm）

top1 top5 0.3 0.5

意见：
（1）我现在batchsize是32，用一个gpu可以快点，用64或者128
（2）finetuning整个网络，初始lr可以大一点，0.005~0.01之间都可以（我现在用的0.001）
（3）如果lr设置的大，gamma没必要用这么大，gamma越小，lr衰减越快
（4）max_iter、stepsize要根据自己的数据，计算epoch。开始可以安排大一点，比如max_iter设置15w或者20w

0x06 期间遇到问题

1、windows编写的脚本，xftp到Linux总报not found错

之前有说是文件尾问题，但是试过并没有用

在windows编写的service脚本（无sh为后缀的）在linux中使用，使用vim打开文件，使用:set ff=unix，就可以将文件转换为linux识别的格式。

详见：http://blog.csdn.net/faryang/article/details/52348029

2、Syntax error: redirection unexpected

执行这个脚本一直报错

##!/usr/bin/env sh#!/bin/bash#By Bill#GPU_ID=3NET=finetune_googlenet_food101SOLVER=finetune_googlenet_foodai101_solver.prototxtTOOLS=caffeWEIGHTS=./bvlc_googlenet.caffemodel#set -x#set -e#LOG=logs/${NET}.txt.`date +'%Y-%m-%d_%H-%M-%S'`LOG=logs/${NET}.logexec &> >(tee -a "$LOG")echo Logging output to "$LOG"#./build/tools/caffe train --solver=./examples/food_tst/${SOLVER} --weights=./examples/food_tst/bvlc_googlenet.caffemodel --gpu=5$TOOLS train \    --solver=${SOLVER} \    --gpu=4,5    #--snapshot=examples/food_tst/googlenet_food100_aug_iter_5000.solverstate    --weights=$WEIGHTS \    #| tee $LOG

这里写图片描述

原因是这个应该用bash来跑，我用了sh

详见：https://stackoverflow.com/questions/2462317/bash-syntax-error-redirection-unexpected

3、Data layer prefetch queue empty

当用image直接输入的时候，就会报这个错，因为data的IO太耗时，于是我转为使用lmdb

https://github.com/BVLC/caffe/issues/3177

4、是不是最好跑caffe的时候直接用python而不是caffe命令行，有什么好处？直接画图？

从caffe命令行打印到log画图就好

5、找不到caffemodel，意味着从scratch训练结果 66/86？

这里写图片描述

wh_train: line 24: –weights=./bvlc_googlenet.caffemodel: No such file or directory

shell脚本如下

##!/usr/bin/env sh#!/bin/bash#By Bill#GPU_ID=3NET=finetune_googlenet_food101SOLVER=/home/huwang/bcaffe/dataset/food-101/finetune_googlenet_foodai101_solver.prototxtTOOLS=caffeWEIGHTS=/home/huwang/bcaffe/dataset/food-101/bvlc_googlenet.caffemodel#set -x#set -eLOG=logs/${NET}_`date +'%Y-%m-%d_%H-%M-%S'`.log#LOG=logs/${NET}.logexec &> >(tee -a "$LOG")echo Logging output to "$LOG"#./build/tools/caffe train --solver=./examples/food_tst/${SOLVER} #--weights=./examples/food_tst/bvlc_googlenet.caffemodel --gpu=5$TOOLS train \    --solver=${SOLVER} \    --gpu=4,5    #--snapshot=examples/food_tst/googlenet_food100_aug_iter_5000.solverstate    --weights=$WEIGHTS \    #| tee $LOG

原因是在这里的--gpu=4,5这句后面缺少 \ 号，且后面注释一句话

这里写图片描述

6、batchsize/GPU

报错Check failed: batch * solver_count == total Batch size must be divisible by the number of solvers (GPUs)，原因是用了3个核，于是改用了两个核，详见batchsize_divide_gpu.log

nVidia的跟普通的BVLC caffe处理逻辑有略微差别，每次处理的是batch * solver_count (GPUs)，需要被gpu数可分

个人觉得应该是BVLC caffe的prototxt的batchsize是单个GPU核的batchsize，但是nVidia的是多个gpu平分

详见：https://github.com/NVIDIA/DIGITS/issues/413

7、snapshot跑多次是会覆盖原文件吗

不会的，会在那个的基础上再跑，事实上试过发现是覆盖的

0x07 实验结果

对应：
https://github.com/billhhh/caffe-LARC/tree/master/6%5Bfine%20tune%5Dfood-101

No. description top1 top5 1 mean:imgnet batch_size:32/32 lr:0.001 gamma:0.96 max_iter:100000 scratch server gpu:1 20170816_9pm 忘加test mean 0.268406 0.544563 2 mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx ncaffe gpu:2 20170817_6pm 直接读入img太慢搁浅 NA NA 3 mean:imgnet batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch server gpu:3 20170817_9pm 重编译caffe 0.688359 0.894203 4 mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx gpu:2 20170817_9pm 相对路径实验OK 0.668656 0.886813 5 mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx gpu:2 20170818_8am 同4，改为绝对路径 0.662859 0.881969 6 mean:imgnet batch_size:64/64 lr:0.001 gamma:0.96 max_iter:100000 scratch dgx gpu:2 20170818_11am 0.641859 0.868469 7 mean:imgnet batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170818_8.21pm 0.801063 0.945859 8 mean:imgnet batch_size:64/64 lr:0.008 gamma:0.2 max_iter:300000 pre-trained dgx gpu:2 20170818_12pm 0.798828 0.945312 9 mean:food-101 batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170819_10.45am 0.802125 0.946219 10 mean:food-101 batch_size:64/64 lr:0.005 gamma:0.1 max_iter:100000 pre-trained dgx gpu:2 20170819_4pm 0.802047 0.948672 11 mean:food-101 batch_size:128/128 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170819_8pm 0.802664 0.947563 12 pre-trained googLeNet（直接跑test） dgx gpu:2 20170820_10am 14 mean:food-101 batch_size:64/100 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170820_11am 0.801186 0.94668

baseline

（1）from “Deep Learning Based Food Recognition”

这里写图片描述

（2）from “Wide-Slice Residual Networks for Food Recognition”

这里写图片描述

附：

DGX Few things to note

Important 1: DO NOT run apt-get upgrade or update or install any packages on the DGX1 host. You should install packages on your own containers only. If you need to install anything on the host system, please kindly contact System Administrator for assistance.

Important 2: You MUST name your containers with your username, failure to comply will result in the removal of containers without prior notice

Use docker container only to run your code
You have to use docker container to run your GPU codes. NVIDIA provides nvidia-docker, a wrapper around docker-cli to run GPU application. You have to use nvidia-docker to run applications which require GPU, otherwise your application will not leverage NVIDIA GPUs.
Use docker image as backup
Please backup your docker containers by using docker image and store them in the Storrage Array. In the event of docker container corruption, the backup image can be used to recover and restore the docker containers.
GPU Codes (Recommended)
To maximize the computing capabilities of the DGX-1 and also to ensure there is sufficient CPU resources for everyone to perform their experiments, you should execute the codes using GPU mode instead of CPU mode.
For TensorFlow and Cuda framework, you can refer to the guide below for assigning single or multiple GPUs on your code as well as to prevent over utilising the resources.

Tensorflow - https://www.tensorflow.org/how_tos/using_gpu/
cuda flag, CUDA_VISIBLE_DEVICES - http://acceleware.com/blog/cudavisibledevices-masking-gpus

Storage Array
DGX-1 Harddisk has no redudancy and optimized only for speed. We suggest you to put your important stuffs in storage array (/mnt/StorageArray2_DGX1/). It has redudancy and more space. You can copy your data temporarily to DGX-1 harddisk to speed up your experiment, but always have backup in storage array. Use -v option in nvidia-docker run command to mount host folder to your container, for example:

nvidia-docker run -it --name <username>-tensorflow -v /mnt/StorageArray2_DGX1/<username>/codes:/opt/codes compute.nvidia.com/nvidia/tensorflow bash

that command will map /mnt/StorageArray2_DGX1/<username>/codes to /opt/codes in your container.
compute.nvidia.com/nvidia/tensorflow is the official torch container from nvidia

You can read my blog about basic nvidia-docker, and commonly use nvidia-docker commands:
https://philipskokoh.github.io/blog/nvidia-docker-for-your-GPU-application-development
such as docker ps –s to check containers & nvidia-smi to check GPU usage

Please read introduction on nvidia-docker from nvidia too:
https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/

Food100

Hi， I have put the SGFood100 data in /data5/FOODAI/Food100 folder. There are two folders: Images placing source images and ImageSets placing my original Imagesets and label mapping as well as the corresponding lmdb data(you’d better generate it by yourself again). The absolute path is a bit different now but only a small modification is required.

I also put the results based on VGG16, AlexNet, Res152 and InceptionV1 in results.md, you can refer these results I got based on the provided imagesets

阅读全文

1 0