cudnn版本问题导致tensorflow GPU源码编译失败

来源:互联网 发布:广电网络wifi上不了网 编辑:程序博客网 时间:2024/05/27 02:30

终于GPU tensorflow源码编译,安装成功。把碰到的问题分享一下。

硬件: Ryzen 1700X + 16GB RAM + GeForce GTX950 (2G)

           有资料上讲显卡至少要6GB的1060 以上,近半年显卡价格实在较高,950先用着。

软件: Ubuntu 16.04 + CUDA 9.0 +  cudnn-9.0-linux-x64-v7 + tensorflow-1.4.0rc1


从网上查找资料,大都是 CUDA8.0 + cudnn V5 。依样画葫芦,从nvidia官网下载,但最近下载的已经是CUDA9.0,并非CUDA8.0 (官网已经更新了),本着用新不用旧的思想,决定用CUDA9.0。

在下载cudnn的时候,首先要注册。碰到几个问题   1)有一段时间根本打不开invidia的注册网页(网络没问题)  2)后来网页可以打开了,但在选择国家处,下拉框上空白,选不到这个必选项,无法注册成功。因此无法从官网下载cudnn。有网友提供了V5的云盘资源(cudnn-8.0-linux-x64-v5.1),因此配套就是:CUDA9.0 + cudnn-8.0-linux-x64-v5.1

安装后,在tensorflow的./configure 配置中分别配置CUDA9.0,CUDNN 5.1.0,配置过程未提示任何错误。但在用如下命令,编译时出错。错误信息如下

bazel build --copt=-march=native -c opt --config=cuda --verbose_failures //tensorflow/tools/pip_package:build_pip_package

错误信息1:

ERROR:/home/kou/tensorflow/tensorflow/stream_executor/BUILD:52:1: C++compilation of rule '//tensorflow/stream_executor:cuda_platform'failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: errorexecuting command

(cd/home/kou/.cache/bazel/_bazel_kou/3f3a4712723b62ae321569eb62995c39/execroot/org_tensorflow&& \

execenv - \

LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:\

PATH=/home/kou/anaconda3/bin:/usr/local/cuda-9.0/bin:/home/kou/bin:/home/kou/.local/bin:/home/kou/anaconda3/bin:/usr/local/cuda-9.0/bin:/home/kou/bin:/home/kou/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/home/kou/bin\

PWD=/proc/self/cwd\

external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc-U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall-Wunused-but-set-parameter -Wno-free-nonheap-object-fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections-fdata-sections -g0 '-std=c++11' -g0 -MD -MFbazel-out/host/bin/tensorflow/stream_executor/_objs/cuda_platform/tensorflow/stream_executor/cuda/cuda_dnn.pic.d'-frandom-seed=bazel-


错误信息2:

Target//tensorflow/tools/pip_package:build_pip_package failed to build

ERROR:/home/kou/tensorflow/tensorflow/contrib/boosted_trees/BUILD:423:1 C++compilation of rule '//tensorflow/stream_executor:cuda_platform'failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: errorexecuting command

(cd/home/kou/.cache/bazel/_bazel_kou/3f3a4712723b62ae321569eb62995c39/execroot/org_tensorflow&& \

execenv - \

LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:\

PATH=/home/kou/anaconda3/bin:/usr/local/cuda-9.0/bin:/home/kou/bin:/home/kou/.local/bin:/home/kou/anaconda3/bin:/usr/local/cuda-9.0/bin:/home/kou/bin:/home/kou/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/home/kou/bin\

PWD=/proc/self/cwd\

external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc-U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall-Wunused-but-set-parameter -Wno-free-nonheap-object-fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections-fdata-sections -g0 '-std=c++11' -g0 -MD -MF bazel-


查阅了不少资料,百思不得其解。有几个资料提到在CUDA8.0时,由于cudnn版本过低出现过问题。但提示错误信息不同。怀疑是否CUDA9.0 与cudnn V5不配套。

再次官网注册账号,OK了,下载最新的 cudnn-9.0-linux-x64-v7 ,替换V5后,重新执行编译命令,这次从以前的很快报错,到出现warning,info 信息,满屏滚动编译信息,CPU,RAM 占用一次次攀升,感觉有戏了。漫长的等待终于结束,看到久违的编译成功信息:

At global scope:
cc1plus: warning: unrecognized command line option '-Wno-self-assign'
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 2399.049s, Critical Path: 890.03s
INFO: Build completed successfully, 3418 total actions

终于编译成功了。

生成PIP

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

2017年 11月 04日 星期六 21:35:10 CST : === Using tmpdir: /tmp/tmp.DGLrNxOuWV
~/tensorflow/bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles ~/tensorflow
~/tensorflow
/tmp/tmp.DGLrNxOuWV ~/tensorflow
2017年 11月 04日 星期六 21:35:12 CST : === Building wheel
warning: no files found matching '*.dll' under directory '*'
warning: no files found matching '*.lib' under directory '*'
warning: no files found matching '*.h' under directory 'tensorflow/include/tensorflow'
warning: no files found matching '*' under directory 'tensorflow/include/Eigen'
warning: no files found matching '*' under directory 'tensorflow/include/external'
warning: no files found matching '*.h' under directory 'tensorflow/include/google'
warning: no files found matching '*' under directory 'tensorflow/include/third_party'
warning: no files found matching '*' under directory 'tensorflow/include/unsupported'
~/tensorflow
2017年 11月 04日 星期六 21:35:35 CST : === Output wheel file is in: /tmp/tensorflow_pkg

/tmp/tensorflow_pkg目录下输出的安装文件为:

-rw-rw-r--  1 XX XX  76222893 11月  4 21:35 tensorflow-1.4.0rc1-cp36-cp36m-linux_x86_64.whl

安装:

pip install /tmp/tensorflow_pkg/tensorflow-1.4.0rc1-cp36-cp36m-linux_x86_64.whl 
Processing /tmp/tensorflow_pkg/tensorflow-1.4.0rc1-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: six>=1.10.0 in /home/kou/anaconda3/lib/python3.6/site-packages (from tensorflow==1.4.0rc1)
Requirement already satisfied: wheel>=0.26 in /home/kou/anaconda3/lib/python3.6/site-packages (from tensorflow==1.4.0rc1)
Collecting enum34>=1.1.6 (from tensorflow==1.4.0rc1)
  Downloading enum34-1.1.6-py3-none-any.whl
Collecting tensorflow-tensorboard<0.5.0,>=0.4.0rc1 (from tensorflow==1.4.0rc1)
  Downloading tensorflow_tensorboard-0.4.0rc2-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 107kB/s 
Collecting protobuf>=3.4.0 (from tensorflow==1.4.0rc1)
  Downloading protobuf-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (6.2MB)
    100% |████████████████████████████████| 6.2MB 64kB/s 
Requirement already satisfied: numpy>=1.12.1 in /home/kou/anaconda3/lib/python3.6/site-packages (from tensorflow==1.4.0rc1)
Requirement already satisfied: werkzeug>=0.11.10 in /home/kou/anaconda3/lib/python3.6/site-packages (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.0rc1)
Collecting html5lib==0.9999999 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.0rc1)
  Downloading html5lib-0.9999999.tar.gz (889kB)
    100% |████████████████████████████████| 890kB 123kB/s 
Collecting markdown>=2.6.8 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.0rc1)
  Downloading Markdown-2.6.9.tar.gz (271kB)
    100% |████████████████████████████████| 276kB 112kB/s 
Collecting bleach==1.5.0 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.0rc1)
  Downloading bleach-1.5.0-py2.py3-none-any.whl
Requirement already satisfied: setuptools in /home/kou/anaconda3/lib/python3.6/site-packages (from protobuf>=3.4.0->tensorflow==1.4.0rc1)
Building wheels for collected packages: html5lib, markdown
  Running setup.py bdist_wheel for html5lib ... done
  Stored in directory: /home/kou/.cache/pip/wheels/6f/85/6c/56b8e1292c6214c4eb73b9dda50f53e8e977bf65989373c962
  Running setup.py bdist_wheel for markdown ... done
  Stored in directory: /home/kou/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
Successfully built html5lib markdown
Installing collected packages: enum34, protobuf, html5lib, markdown, bleach, tensorflow-tensorboard, tensorflow
  Found existing installation: html5lib 0.999999999
    Uninstalling html5lib-0.999999999:
      Successfully uninstalled html5lib-0.999999999
  Found existing installation: bleach 2.0.0
    Uninstalling bleach-2.0.0:
      Successfully uninstalled bleach-2.0.0
Successfully installed bleach-1.5.0 enum34-1.1.6 html5lib-0.9999999 markdown-2.6.9 protobuf-3.4.0 tensorflow-1.4.0rc1 tensorflow-tensorboard-0.4.0rc2


测试:

kou@aikou:/tmp/tensorflow_pkg$ python
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, Tensorflow!')
>>> sess = tf.Session()
2017-11-04 21:50:03.062971: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-04 21:50:03.063348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties: 
name: GeForce GTX 950 major: 5 minor: 2 memoryClockRate(GHz): 1.355
pciBusID: 0000:26:00.0
totalMemory: 1.95GiB freeMemory: 1.36GiB
2017-11-04 21:50:03.063371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 950, pci bus id: 0000:26:00.0, compute capability: 5.2)
>>> print(sess.run(hello))
b'Hello, Tensorflow!'


总结:

1、在编译tensorlfow源码过程中,出现的奇怪错误,实际是CUDA,cudnn 版本不配套导致的。经验证下面这个组合是配套的:

        cuda_9.0.176_384.81_linux  +  cudnn-9.0-linux-x64-v7

2、在编译tensorlfow源码的命令中,可加入 --verbose_failures 参数(如下),编译过程中将打印输出ERROR信息,以便进一步分析定位。

      不过上述错误,从给出的错误信息中,似乎无法有效分析定位

       bazel build --copt=-march=native -c opt --config=cuda --verbose_failures //tensorflow/tools/pip_package:build_pip_package

3、Ryzen 1700X的8核16线程,在编译过程中还是发挥了巨大作用,编译时间明显缩短,下面是比较数据:

     AMD Ryzen 1700X + 16GRAM , tensorflow编译GPU版源码,用时 2399.049s

     Intel i5 6500 + 16GRAM,tensorfloe编译CPU版源码,用时 5934.066s


后记(2017-12-17日)

最近几天,UBUNTU系统出故障,重装了系统。 CUDA,cudnn 仍然使用上述版本,下载了最新的tensorflow(12月16日),发现编译失败。 用回上面案例中的tensorflow源码(11月初版本),编译顺利通过。网上有说新版本需要patch,但不想折腾安装了。先用着。这个问题,可能需要注意两点:

1、尽量都使用最新的,同一时间的 cuda,cudnn,tensorflow版本。 上述成功编译的3个版本就是同一天从各自官网下载的。

2、注意备份,学习以及验证,能跑先跑起来,少折腾

原创粉丝点击