CPU与GPU的内存带宽对比（CPU vs CUDA GPU memory bandwidth）

来源：互联网发布：淘宝订单贷款是什么编辑：程序博客网时间：2024/06/05 18:27

原文链接：http://blog.cudachess.org/2009/07/cpu-vs-cuda-gpu-memory-bandwidth/

导读：

最近打算学习CUDA，但在与一个同学聊天时她提到GPU并不适用于某些类型的计算，瓶颈在于I/O上。可我看了下GPU的参数，内存带宽（Memory Bandwidth）很高，怎么会这样呢？下面这篇文章可以回答这个问题。

如何对比和解读现代CPU与使用CUDA架构的GPU的内存带宽差距？

根据我个人目前的研究，我认为尽管GPU的内存带宽很大，但CPU的一级缓实际上比CUDA架构效率更高。

CUDA GPU的速度可以达到gigaflops（每秒10亿次浮点操作），是Core i7/Nethalem速度的十倍。为充分利用强大的计算能力，我们需要从存储器中（全局显存或计算机内存）尽量快地给他们提供数据。

我通过这篇有趣的文章benchmarked overclocked Core i7 cache and memory bandwidth发现在三通道DDR3中：一级缓存的读写峰值可以达到50GB/s，但这两个操作是可以同时进行的，因此总峰值可以达到100GB/s，但计算机内存速度（三通道DDR3）仅为16GB/s。这很令人惊讶，三年前的 Athlon X2 3800+ (2×2Hz)一级缓存比现在最新的主存速度要快！（译者注：怀疑原文输入错误，应该是惊叹三年前的比现在快，而不是反之）

CUDA的共享存储器 (16KB/8 Scalar Processors)和CPU的一级缓存（32K）的速度差不多，都是50GB/s。

GPU的共享存储器内存带宽可以达到100GB/s ～ 150GB/s，是计算机内存带宽的8倍，这是因为多个64位接口（8 vs 3）和更高的时钟频率。

下面比较GPU的共享内存读写速度和CPU的一级缓存读写速度。对于i7处理器，因为四个核都有自己的一级缓存，因此峰值可以达到200～400GB/s。而CUDA GTX285因为有30组8标量处理器，因此期内存带宽可以达到1500GB/s，是超频后i7的4倍。

总结一下，CUDA GPU的全局存储器速度是计算机内存的8倍，共享存储器是现代CPU一级缓存的4倍。

原文如下：

What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?

As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.

With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).

I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!

To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.

To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.

But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.

On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!

To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!