python多线程下载vgg_face_data

来源：互联网发布：js给input text赋值编辑：程序博客网时间：2024/06/05 06:04

简介

最近搜寻公开的人脸数据库时发现vgg-face-data的数据量相比于webface还算挺大，不过下载下来之后才发现官网给出的只是图片的id和url以及一些其他信息，遂写一个python脚本进行下载。图片地址信息请自行去官网下载，这里给出链接。
vgg_face_dataset.tar.gz

主要思想

vgg_face_dateset给出的链接方式是每个人一个txt文件，每个文件里面包含1000行左右内容，每行包括图片id，图片url等信息，只需遍历这些文件下载即可，而且这些文件之间没有数据共享，最适合使用多线程方法进行。在下载脚本同级目录下新建image文件夹，下载图片全部保存到该文件夹下，每个identity一个子文件夹。

download.py

代码比较简单，具体功能一看就懂：

#!/usr/bin/python#-*- coding: utf-8 -*-"""Created on Sat. Apil 8 09:19:38 2017@author: wujiyang"""import sysimport osimport threadingimport urllib'''递归遍历文件夹，得到所有文件名'''def dir_list(path):    allfile = []    filelist = os.listdir(path)    for filename in filelist:        filepath = os.path.join(path, filename)        if os.path.isdir(filepath):            dir_list(filepath, allfile)        else:            allfile.append(filepath)    return allfile'''保存远程url图片数据'''def download_and_save(url,savename):    try:        urlopen=urllib.URLopener()        fp = urlopen.open(url)        data = fp.read()        fp.close()        fid=open(savename,'w+b')        fid.write(data)        print "download succeed: "+ url        fid.close()    except IOError:        print "download failed: "+ urldef get_all_iamge(filename):    fid = open(filename)    name = filename.split('\\')[-1]    name = name[:-4]    lines = fid.readlines()    for line in lines:        line_split = line.split(' ')        image_id = line_split[0]        image_url = line_split[1]        if False == os.path.exists('images' + '/' + name):            os.mkdir('images' + '/' + name)        savefile = 'images' + '/' + name + '/' + image_id + '.jpg'          #The maxSize of Thread numberr:1000        while True:            if(len(threading.enumerate()) < 1000):                break                       t = threading.Thread(target=download_and_save,args=(image_url,savefile,))        t.start()'''usage: python download.py .\vgg_face_dataset\files'''if __name__ == "__main__":    if len(sys.argv) != 2:        print'Usage:python %s faceUrl.txt'%(sys.argv[0])        sys.exit()    fileDir = sys.argv[1]    list = dir_list(fileDir)    for i in range(len(list)):        #print list[i]        get_all_iamge(list[i])

总结

下载下来之后发现，数据虽然较多，但是比较“脏”。每个类别中前200，300张感觉还好，后面貌似都不太像这个人了，而且还有很多其他的错误数据，比如偶尔出现一个异性等。实际使用时还需要做不少预处理。

不过还是要感谢VGG Group的工作与无私奉献，毕竟搜集这么多图片也是很不容易的一件事！

1 0