深度学习GPU卡的理解(二)
来源:互联网 发布:jqplot 动态数据 编辑:程序博客网 时间:2024/05/21 09:14
继续搬砖第二篇
《How To Build and Use a Multi GPU System for Deep Learning》
When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can just learn much more quickly. At the beginning, deep learning is a lot of trial and error: You have to get a feel what parameters need to be adjusted, or what puzzle piece is missing in order to get a good result. A GPU helps you to fail quickly and learn important lessons so that you can keep improving. Soon my deep learning skills were sufficient to take the 2nd place in the Crowdflower competition where the task was to predict weather labels from given tweets (sunny, raining etc.).
After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster. I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. This gave me opportunity to get the money to build the GPU cluster I thought of.
Important components in a GPU cluster
When I did my research on which hardware to buy I soon realized, that the main bottleneck will be the network bandwidth, i.e. how much data can be transferred from computer to computer per second. The network bandwidth of network cards (affordable cards are at about 4GB/s) does not come even close to the speed of PCIe 3.0 bandwidth (15.75 GB/s). So GPU-to-GPU communication within a computer will be fast, but it will be slow between computers. On top of that most network card only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them.
This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. Luckily, Mellanox and Nvidia recently came together to work on that problem and the result is GPUDirect RDMA, a network card driver that can make sense of GPU memory addresses and thus can transfer data directly from GPU to GPU between computers.
NVIDIA GPUDirect RDMA can bypass the CPU for inter-node communication – data is directly transfered between two GPUs.
Generally your best bet for cheap network cards is eBay. I won an auction for a set of 40Gbit/s Mellanox network cards that support GPUDirect RDMA along with the fitting fibre cable on eBay. I already had two GTX Titan GPUs with 6GB of memory and as I wanted to build huge models that do not fit into a single memory, so I decided to keep the 6GB cards and buy more of them to build a cluster that features 24GB memory. In retrospect this was a rather foolish (and expensive) idea, but little did I know about the performance of such large models and how to evaluate the performance of GPUs. All the lessons I learned from this can be found here. Besides that the hardware is rather straightforward. For fast inter-node communication PCIe 3.0 is faster than PCIe 2.0, so I got a PCIe 3.0 board. It is also a good idea to have about two times the RAM than you have GPU memory to be able to work more freely to handle big nets. As deep learning programs use a single thread for a GPU most of the time, a CPU with as many cores as GPUs you have is often sufficient.
Hardware: Check. Software: ?
There are basically two options how to do multi-GPU programming. You do it in CUDA and have a single thread and manage the GPUs directly by setting the current device and by declaring and assigning a dedicated memory-stream to each GPU, or the other options is to use CUDA-aware MPI where a single thread is spawned for each GPU and all communication and synchronization is handled by MPI. The first method is rather complicated as you need to create efficient abstractions where you loop through the GPUs and handle streaming and computing. Even with efficient abstractions your code can blow up quickly in line count making it less readable and maintainable.
Some sample MPI code. The first action spreads one chuck of data to all others computer in the network; the second action receives one chuck of data from every process. That is all you need to do, it is very easy!
The second option is much more efficient and clean. MPI is the standard in high performance computing and its standardized library means that you can be sure that a MPI method really does what it is supposed to do. Underlying MPI the same principles are used as in the first method describes above, but the abstraction is so good that it is quite easy to adapt single GPU code to multiple GPU code (at least for data parallelism). The result is clean and maintainable code and as such I would always recommend using MPI for multi-GPU computing. As MPI libraries come in many languages and you can pair them with the language of your choice. With these two components you are ready to go and can immediately start programming deep learning algorithms for multiple GPUs.
- 深度学习GPU卡的理解(二)
- 深度学习GPU卡的理解(一)
- 深度学习GPU卡的理解(三)
- 深度学习GPU卡的理解(四)
- 深度学习GPU卡的理解(五)
- 深度学习笔记(二):基于tensorflow gpu版本的深度神经网络程序总览
- 针对深度学习的GPU芯片选择
- 深度学习的GPU硬件选型
- CUDA 学习(二)、使用GPU理解并行计算
- 深度学习中如何选择一款合适的GPU卡的一些经验和建议分享
- 深度学习中如何选择一款合适的GPU卡的一些经验和建议分享
- 深度学习中如何选择一款合适的GPU卡的一些经验和建议分享
- 研究深度学习的硬件配置(折腾GPU)
- 深度学习之五:使用GPU加速神经网络的训练
- 深度学习卷积算法的GPU加速实现方法
- 为你的深度学习任务挑选最合适GPU
- 深度学习的最好方案,FPGA或GPU?
- Pluto-基于Caffe的GPU多机多卡深度学习算法产品
- 对JavaWeb的结构认识
- Mysql的第一天
- 怎样查询自己的苹果手机各个软件的大小,占用多少内存?
- Python 爬虫:获取网页图片
- AngularJS 控制器
- 深度学习GPU卡的理解(二)
- 传感器--概述
- 二级菜单的实现
- 文字阴影
- Linux-oracle_tar包方式安装文档
- 离散傅里叶变换代码解读以及一些展示,by《opencv3编程入门》p139
- 自定义EL表达式的功能方法
- android将线程绑定在指定CPU
- 在Keil 中找不到对应芯片厂家的型号