win10+anaconda3+tensorflow-gpu一些报错的解决方法

来源:互联网 发布:linux 文件写权限 编辑:程序博客网 时间:2024/06/05 20:21

其实这就是个踩坑记录

作为一名苦逼研究生苦苦守在windows系统上,只因为当初买电脑的时候搞了一台无法装ubuntu的Acer.
之前用过一段时间的TensorFlow(https://www.tensorflow.org/), 好在现在已经有了能在windows上使用的tensorflow,甚至通过conda可以直接安装,不过这个版本的tensorflow 是cpu版本的。

先前做过对照测试,同一个训练程序GPU的训练速度大约是CPU的20倍,20倍!!!,所以在尝试了一下之后就决定转战GPU版本的tensorflow。

已经有先驱在Win10+anaconda3安装tensorflow-gpu了,参考这里,写得很详细,基本上按部就班就能安装好。但安装的过程中以及后续的使用中的碰到了一些问题,记一下解决方案。

VS2017

对,我用的是vs2017 community,原因是因为现在vs2015属于旧版本,官网上下载不便(反正我翻了几次没找到2015的)。但是打开CUDA的编译文件夹的时候你会发现它之支持到2015。最后是退而求其次用vs2013编译了对应版本的CUDA,顺利编译。

CUDA与cudnn

现在NVIDIA官网上提供的适配于Win10的安装程序为cuda_8.0.61_win10.exe,和cudnn-8.0-windows10-x64-v6.0

之前那篇教程是将cudnn作为一个附加的组件来安装的,然而实际上,cudnn是必不可少的,否则将无法在 python中import tensorflow

官网提供的是cudnn 6, 在实际使用的时候,会发现现在 cudnn6还存在一个bug, 导致无法顺利import tensorflow。

如果你去查这个“未发现相应模块”的错误,你会在StackFLow上发现有人给出的思路清奇的解决方案:将之前cudnn文件的三个文件夹覆盖过去后再将C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin下的cudnn64_6.dll改成cudnn64_5.dll。还有一堆人点赞。

别信他,这么改虽然能让tensorflow顺利import,但是在实际使用CNN的时候依旧会因为版本问题报错, 正确姿势是去https://developer.nvidia.com/rdp/cudnn-download 选Download cuDNN v5.1 (Jan 20, 2017), for CUDA 8.0。

载入替换掉原来在C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin下的cudnn64_5.dll后就可以顺利在python中使用tensorflow自带的CNN功能了。

然而,问题并没有结束
在Tensorflow-gpu使用ImageNet时需要跑个测试, 会发现依旧会报错。

2017-08-06 11:23:17.058978: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:name: GeForce 940Mmajor: 5 minor: 0 memoryClockRate (GHz) 1.176pciBusID 0000:01:00.0Total memory: 2.00GiBFree memory: 1.66GiB2017-08-06 11:23:17.059122: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 02017-08-06 11:23:17.062956: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   Y2017-08-06 11:23:17.065284: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0)2017-08-06 11:23:18.495540: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\framework\op_def_util.cc:332] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().2017-08-06 11:23:21.602161: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED2017-08-06 11:23:21.602280: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:366] error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows2017-08-06 11:23:21.605768: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM2017-08-06 11:23:21.606460: F c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\36\tensorflow\core\kernels\conv_ops.cc:671] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

忽略我的渣显卡

这个问题在GitHub上已经摞了很高的一个楼,貌似是Windows特有的,而且和显存容量有关。最后有一位宛如救世主的老兄给出了他的总结性发言与变相的解决方案

Here is a bit more info on how I temporarily resolved it. I believe these issues are all related to GPU memory allocation and have nothing to do with the errors being reported. There were other errors before this indicating some sort of memory allocation problem but the program continued to progress, eventually giving the cudnn errors that everyone is getting. The reason I believe it works sometimes is that if you use the gpu for other things besides tensorflow such as your primary display, the available memory fluctuates. Sometimes you can allocate what you need and other times it can’t.

From the API
https://www.tensorflow.org/versions/r0.12/how_tos/using_gpu/
“By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.”

I think this default allocation is broken in some way that causes this erratic behavior and certain situations to work and others to fail.

I have resolved this issue by changing the default behavior of TF to allocate a minimum amount of memory and grow as needed as detailed in the webpage.
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, …)

I have also tried the alternate way and was able to get it to work and fail with experimentally choosing a percentage that worked. In my case it ended up being about .7.

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, …)

Still no word from anyone on the TF team confirming this but it is worth a shot to see if others can confirm similar behavior.

就是在建立session的时候修改一下GPU配置,算是跨过了这个问题。


在windows上用tensorflow-gpu估计就是一条坑比的道路,可以预见以后还会有许多问题,先占个坑,再更。


9月17日更新
在windows平台上tensorflow这个调用GPU的问题是没法解决了,最近用了Pytorch,完全支持Anaconda3, 并且GPU支持没有任何问题。在此强烈推荐一下pytorch,深度学习框架界的numpy!

windows的pytorch安装参见:

PyTorch在64位Windows下的Conda包

原创粉丝点击