CPU与GPU的内存带宽对比(CPU vs CUDA GPU memory bandwidth)
来源:互联网 发布:淘宝订单贷款是什么 编辑:程序博客网 时间:2024/06/05 18:27
原文链接:http://blog.cudachess.org/2009/07/cpu-vs-cuda-gpu-memory-bandwidth/
导读:
最近打算学习CUDA,但在与一个同学聊天时她提到GPU并不适用于某些类型的计算,瓶颈在于I/O上。可我看了下GPU的参数,内存带宽(Memory Bandwidth)很高,怎么会这样呢?下面这篇文章可以回答这个问题。
如何对比和解读现代CPU与使用CUDA架构的GPU的内存带宽差距?
根据我个人目前的研究,我认为尽管GPU的内存带宽很大,但CPU的一级缓实际上比CUDA架构效率更高。
CUDA GPU的速度可以达到gigaflops(每秒10亿次浮点操作),是Core i7/Nethalem速度的十倍。为充分利用强大的计算能力,我们需要从存储器中(全局显存或计算机内存)尽量快地给他们提供数据。
我通过这篇有趣的文章benchmarked overclocked Core i7 cache and memory bandwidth发现在三通道DDR3中:一级缓存的读写峰值可以达到50GB/s,但这两个操作是可以同时进行的,因此总峰值可以达到100GB/s,但计算机内存速度(三通道DDR3)仅为16GB/s。这很令人惊讶,三年前的 Athlon X2 3800+ (2×2Hz)一级缓存比现在最新的主存速度要快!(译者注:怀疑原文输入错误,应该是惊叹三年前的比现在快,而不是反之)
CUDA的共享存储器 (16KB/8 Scalar Processors)和CPU的一级缓存(32K)的速度差不多,都是50GB/s。
GPU的共享存储器内存带宽可以达到100GB/s ~ 150GB/s,是计算机内存带宽的8倍,这是因为多个64位接口(8 vs 3)和更高的时钟频率。
下面比较GPU的共享内存读写速度和CPU的一级缓存读写速度。对于i7处理器,因为四个核都有自己的一级缓存,因此峰值可以达到200~400GB/s。而CUDA GTX285因为有30组8标量处理器,因此期内存带宽可以达到1500GB/s,是超频后i7的4倍。
总结一下,CUDA GPU的全局存储器速度是计算机内存的8倍,共享存储器是现代CPU一级缓存的4倍。
原文如下:
What is the memory bandwidth of modern CPU versus that of CUDA-enabled GPU?
As far as I figured it out, I thought GPU memory bandwidth was huge, but I thought that memory bandwidth of CPU L1-cache could be effectively better than actual CUDA architecture.
With all the horsepower delivered by CUDA GPU, up to 10X Gigaflops on GTX than current Core i7/Nehalem processors, we all need to be able to feed them with data and unload results as fast as possible in memory (global videocard memory or computer’s main memory).
I found an interesting article that benchmarked overclocked Core i7 cache and memory bandwidth, in triple-channel with fast DDR3: L1 cache peaks around 50GB/s reading or writing but could do both at once, peaking at 100GB/s, while main computer memory (triple-channel DDR3) was limited to 16 GB/s. That’s actually astonishing anyway, a 3 years old Athlon X2 3800+ (2×2Hz) L1-cache doesn’t deliver more than actual main memory of today!!!
To compare the L1 cache of a CPU (32KB), we should use CUDA Shared Memory (16KB/8 Scalar Processors), and it delivers around 50GB/s too, a value that is strangely similar.
To compare the main memory of the computer we have the Global Memory and it delivers between 100GB/s and 150GB/s, nearly 8X the computer’s main memory bandwidth, due to multiple 64-bits interface (8 instead 3) and higher clock values.
But when you test a shared memory access or a L1-cache access speed, you have to think there’s 4 core on a core i7, each one with it’s dedicated L1-cache, peaking at 200GB-400GB/s depending on the tasks.
On the other side, with 30 groups of 8 Scalar Processors, the Shared Memory of a CUDA GTX 285 may deliver 1500 GB/s, around 4X the aggregated L1-cache of an overclocked Core i7!
To resume, CUDA-enabled GPU offers up to 8X the speed of main memory and 4X the speed of L1-cache compared to a moderne CPU, and it shows!
- CPU与GPU的内存带宽对比(CPU vs CUDA GPU memory bandwidth)
- GPU与CPU对比测试
- OpenCL与CUDA,CPU与GPU
- GPU和CPU对比
- CPU vs GPU
- GPU VS CPU视屏
- gpu vs cpu
- CPU VS GPU
- CPU VS GPU
- CPU vs. GPU
- iOS CPU VS GPU
- CPU与GPU的同步
- GPU与CPU的区别
- CPU与GPU的区别
- CPU与GPU的区别
- CPU 与 GPU 的介绍
- GPU与CPU版本的矩阵乘法对比
- CPU下的计时与GPU计时对比
- win2003企业版下如何拥有xp美丽的皮肤
- Android常用名令集锦
- Page.ClientScript.RegisterStartupScript 与 Page.ClientScript.RegisterClientScriptBlock 之间的区别
- WEB应用 信息管理系统 数据分析展示系统 OA办公工作流 快速构建与开发平台
- s3c2410MMU(存储器管理单元)讲解 (转)
- CPU与GPU的内存带宽对比(CPU vs CUDA GPU memory bandwidth)
- ajax动态翻页
- VC操作Excel写入数据源码
- MFC 文件夹选择对话框
- 《ARM Cortex-M3权威指南》笔记(1)
- C++ 虚函数表解析
- 笔记20110722
- Wavelet 工具箱的使用
- USING INDUCTION TO DESIGN 使用归纳法设计算法 [10/14]