性能测试

来源:互联网 发布:微信淘宝客返利机器人 编辑:程序博客网 时间:2024/06/05 16:51
这是2007年英特尔多核平台编程优化大赛题,该文的作者拿到了最佳优化奖,但此处的代码并非最优化的,只是改进了乘方、自己实现随机数而已。(最优版本可参见蒋黎的代码和侯思松的代码,感觉很变态…)
其中,C++和C++ CLR的代码相同,只是采用的编译指令不同而已。
此外,C#代码从二维数组改为一维数组,我稍微测试了一下,C#二维数组和嵌套数组的速度确实很慢。原作者的代码似乎还把下标弄错了,所以也改了下。
而Python、Cython、JavaScript和Lua的代码是我自己照英特尔的原程序改编的,没有对随机数进行优化,附在文末。
Squirrel的代码由dwing提供,我只是将table改成了数组,RAND_MAX + 1.0改成了32768.0的常数,时间快了约5秒左右。

测试平台:
CPU:Intel Core2 Duo T9400 @ 2.53GHz
内存:3 GB
操作系统:Windows XP Pro SP2英文版
C++编译器:gcc version 4.4.1 (TDM-2 mingw32)(以下简称G++)、Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86(以下简称VC++ 2008)、VC++ 2010 Beta2
C++ CLR编译器:VC++ 2008、VC++ 2010 Beta2
C#编译器:Microsoft (R) Visual C# 2008 Compiler version 3.5.21022.8、Microsoft (R) Visual C# 2010 Compiler version 4.0.21006.1
Java:Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Python:Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32、Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32、Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32
Cython:0.11.3
Psyco:1.6
IronPython:IronPython 2.6 (2.6.10920.0) on .NET 2.0.50727.1433
JavaScript:ChromePlus 1.2.6.0 (Chromium    4.0.222.3 (Official Build 28644),V8 1.3.15)
Lua:5.1.4
LuaJIT:1.1.5、2.0.0-beta1
Ruby:ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Squirrel:Squirrel 3.0 alpha 2 Copyright (C) 2003-2009 Alberto Demichelis (32 bits)
顺带一提,G++是采用TDM版(目前尚无GCC 4.4.1和4.4.2的Windows版),VC++ 2008编译器是Microsoft Visual C++ Toolkit 2008里的,C#编译器是Microsoft .NET Framework 3.5和4.0 Beta2里的,Cython和LuaJIT用的C编译器是GCC 4.4.1,LuaJIT 2.0.0还测试了VC++ 2003编译的版本,Ruby没有采用更慢的稳定版。

测试时关掉了大部分占用资源的程序,每种搭配测试5次,取时间最少的一次作为最终结果。
从控制台运行时似乎会影响速度(此时是C++ CLR以超过第2名3%的优势胜出),所以对于exe文件,我改成了双击运行和控制台各测5次,取最小值。

测试结果:

语言/编译器

编译参数

运行时间(秒)

相对速度

C++ (G++ 4.4.1)

-O3 -s

1.765

85.84%

C++ (G++ 4.4.1)

-O3 -s -mtune=native -march=native

1.859

81.50%

C++ (G++ 4.4.1)

-O3 -s -mfpmath=sse -mtune=native -march=native

1.578

96.01%

C++ (G++ 4.4.1)

-O3 -s -mfpmath=sse -mtune=native -march=native -ffast-math

1.546

97.99%

C++ (VC++ 2008)

/O2 /MD

1.625

93.23%

C++ (VC++ 2008)

/O2 /MD /arch:SSE2

1.593

95.10%

C++ (VC++ 2008)

/O2 /MD /arch:SSE2 /fp:fast

1.562

96.99%

C++ (VC++ 2010 Beta2)

/O2 /MD

1.64

92.38%

C++ (VC++ 2010 Beta2)

/O2 /MD /arch:SSE2

1.562

96.99%

C++ (VC++ 2010 Beta2)

/O2 /MD /arch:SSE2 /fp:fast

1.531

98.95%

C++ CLR (VC++ 2008)

/O2 /MD /clr

1.578

96.01%

C++ CLR (VC++ 2010 Beta2)

/O2 /MD /clr

1.515

100.00%

C# 3.5

无参数

3.281

46.17%

C# 3.5

/o

3.219

47.06%

C# 4.0 Beta2

无参数

2.203

68.77%

C# 4.0 Beta2

/o

2.187

69.27%

Java 1.6

无参数

2.093

72.38%

Python 2.5.4

无参数

117.163

1.29%

Python 2.6.4

无参数

91.137

1.66%

Python 3.1.1

无参数

109.531

1.38%

Cython 0.11.3

-O3

11.795

12.84%

Psyco 1.6 (bind)

无参数

26.206

5.78%

Psyco 1.6 (full)

无参数

25.650

5.91%

IronPython 2.6

无参数

59.630

2.54%

JavaScript (V8 1.3.15)

无参数

16.653

9.10%

Lua 5.1.4

无参数

48.969

3.09%

Lua 5.1.4 (Luac编译)

-s

49.031

3.09%

LuaJIT 1.1.5

无参数

24.141

6.28%

LuaJIT 1.1.5

运行:-O2

9.641

15.71%

LuaJIT 1.1.5 (Luac编译)

编译:-s

24.093

6.29%

LuaJIT 1.1.5 (Luac编译)

编译:-s 运行:-O2

9.641

15.71%

LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)

无参数

1.546

97.99%

LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)

运行:-O2

4.578

33.09%

LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)

无参数

1.625

93.23%

LuaJIT 2.0.0-beta1 (VC++ 2003编译)

运行:-O2

4.578

33.09%

Ruby 1.9.1

无参数

327.797

0.46%

Squirrel 3.0 alpha 2

无参数

73.873

2.05%

Squirrel 3.0 alpha 2 (编译成字节码)

无参数

77.828

1.95%

冠军让C++ CLR夺走了,这点出乎我的意料。测试了多次,均比其他C++实现的时间要少,可见不属于测试误差。看来相对于原生代码,CLR会自动对当前平台进行优化。缺点就是体积稍大了点(29.5KB(VC++ 2010)/24KB(VC++ 2008),而其他的C++实现为7KB左右,VC++ 2008编译的还会生成1KB manifest文件),且需要安装.net framework才能运行。没有测试Intel编译器是一大遗憾,估计它才是性能王者。(由于运行库不兼容,我的电脑上无法测试;在别人的电脑上,Intel C++ 11编译器速度约为VC++2010 CLR的2倍,非常震惊。)
G++在最佳的参数下和VC++基本持平,不过单一的-O3参数并没有什么优势,要组合出那么多种参数也蛮难为人的,而且-ffast-math也是不符合IEEE规范的浮点实现。顺带一提,用该参数编译未优化版本的结果是1.578秒,和优化后几乎没有区别。
VC++ 2008各种参数的性能都差不多,所以不用过多考虑搭配。
VC++ 2010 Beta2略微超过了VC++ 2008和G++,不过需要附带一个新的动态链接库(msvcr100.dll,749 KB)。
让我无语的是C# 3.5居然远远落后于Java(慢了约53.8%),而且是在Windows操作系统上,也未测试可能更快的Java 7及IBM、DEA的JVM和GCJ。C# 4.0的性能提高了不少,只略低于Java 6了。(顺带一提,C# 2.0和3.5成绩几乎相同,看来微软在4.0上花了大气力来提升性能。)此外,C#生成的代码只有5KB,且无manifest文件。(但别忘了运行库的体积。)
而Python则仅快于Ruby,比静态编译的实现慢了约2个数量级,这也在我意料之中。而在性能方面,2.6.4 > 3.1.1 > 2.5.4,而3.x的表现还不错。顺带一提动态语言的多维数组都比较慢,所以改成了一维。
Cython目前仍无针对数组的优化(详见《CEP 517: Cython array type》),而本程序大量涉及对数组的操作,所以性能提升相当有限。此外,将列表设为了1维,速度比2维快了约50%。同时,由于Cython不能对乘方进行优化,因此我改成了乘法。
Python+Psyco的组合还算不错,只增加了2行,性能就提升了约5倍,凌驾于其他动态语言之上。同上,我将乘方改成了乘法,开方则调用系统函数。
IronPython的性能让我很惊讶,因为Jython的性能据说很糟糕,没想到在.net平台上会发挥得这么好。
JavaScript只测试了性能最好的V8引擎。由于输出是调用DOM API,较为影响速度,所以我直接删除了。优化同上,数组采用1维。
在接受dwing的建议,将全局变量和系统函数声明为local后,Lua的速度提升了1倍,表现很不错。不过Luac编译似乎没有帮助,以后不去用了。此外,同样也将table优化成了1维数组。
LuaJIT 1.1.5表现非常不错,改成local声明后,速度提升了3倍,在-O2开启时居然比Cython更快。速度大约是C++的1/6,Java的1/5,相差已经在一个数量级之内了。
而最近刚出的LuaJIT 2.0.0-beta1更是达到了C++的水平,这对动态语言来说简直是不可思议的。不过大部分测试是稳定在1.625秒,只有一次达到1.546。而VC++ 2003编译出来的是在1.625~1.640之间波动,但体积略小。至于兼容性,采用-O0~-O2来运行都会有大幅的性能下降(非常诡异),而且现在不支持luac编译的代码。此外,文档里也提到了,LuaJIT 2在数值计算方面接近于静态编译语言,而其他方面(例如字符串)仍略逊于后者,整体性能和JVM 1.6 Client持平。
Ruby是很久没碰了,于是测试了半天,选了种最快的实现,但仍比Python 2.6.4慢了59%,速度仅有C++的1/200。代码里我用while取代了for,因为后者(无论0...n还是0.upto(n))比前者慢了一半。(Python里正好相反,于是人们会倾向于用最优雅的方式得到最好的性能。)此外数组采用了1维,乘方改成了乘法,开方用系统函数。
Squirrel表现平平,和Lua相比完全出于劣势,而且我也不喜欢它的语法。顺带一提,第一次运行时为73秒,之后就一直是78秒了,编译成字节码也只能到77秒,这也是测出非编译版本更快的原因。

Python和IronPython:

from random import random

from time import clock

 

NPARTS = 1000

NPARTS2 = NPARTS * 2

NITER = 201

DIMS = 3

pot = .0

SIZE = NPARTS * DIMS

r = [.0]  * SIZE

 

def computePot():

  global NPARTS, DIMS, r, pot

  for i in xrange(NPARTS):

    for j in xrange(i - 1):

      distx = r[j] - r[i]

      disty = r[NPARTS + j] - r[NPARTS + i]

      distz = r[NPARTS2 + j] - r[NPARTS2 + i]

      dist = (distx * distx + disty * disty + distz * distz) ** .5

      pot += 1.0 / dist

 

def initPositions():

  global NPARTS, DIMS, r

  for i in xrange(SIZE):

    r[i] = 0.5 + random()

 

def updatePositions():

  global NPARTS, DIMS, r

  for i in xrange(SIZE):

    r[i] -= 0.5 + random()

 

def main():

  global pot, NITER

  initPositions()

  updatePositions()

 

  start = clock()

  for i in xrange(NITER):

    pot = .0

    computePot()

    if not i % 10:

      print "%5d: Potential: %10.7f" % (i, pot)

    updatePositions()

  stop = clock()

  print "Seconds = %10.9f" % (stop - start)

 

if __name__ == '__main__':

  main()

Python 3000:将Python的代码中,print改成函数调用形式(加括号),xrange改成range即可。

Cython:

from random import random

from time import clock

 

cdef int NPARTS = 1000

cdef int NPARTS2 = NPARTS * 2

cdef int NITER = 201

cdef int DIMS = 3

cdef int SIZE = NPARTS * DIMS

cdef double pot = .0

cdef list r = [.0]  * SIZE

 

cpdef computePot():

  global NPARTS, DIMS, r, pot

  cdef int i, j

  cdef double distx, disty, distz, dist

  for i in xrange(NPARTS):

    for j in xrange(i - 1):

        distx = r[j] - r[i]

        disty = r[NPARTS + j] - r[NPARTS + i]

        distz = r[NPARTS2 + j] - r[NPARTS2 + i]

        dist = (distx * distx + disty * disty + distz * distz) ** .5

        pot += 1.0 / dist

 

cpdef initPositions():

  global NPARTS, DIMS, r

  cdef int i, j

  for i in xrange(SIZE):

    r[i] = 0.5 + random()

 

cpdef updatePositions():

  global NPARTS, DIMS, r

  cdef int i, j

  for i in xrange(SIZE):

    r[i] -= 0.5 + random()

 

def main():

  global pot, NITER

  cdef int i

  cdef double start, stop

  initPositions()

  updatePositions()

 

  start = clock()

  for i in xrange(NITER):

    pot = .0

    computePot()

    if not i % 10:

      print "%5d: Potential: %10.7f" % (i, pot)

    updatePositions()

  stop = clock()

 

  print "Seconds = %10.9f" % (stop - start)

Psyco(使用bind()):

from random import random

from time import clock

from math import sqrt

from psyco import bind

 

 

NPARTS = 1000

NITER = 201

DIMS = 3

pot = .0

r = [[.0]  * NPARTS] * DIMS

 

def computePot():

  global NPARTS, DIMS, r, pot

  for i in xrange(NPARTS):

    for j in xrange(i - 1):

      distx = (r[0][j] - r[0][i])

      distx *= distx

      disty = (r[1][j] - r[1][i])

      disty *= disty

      distz = (r[2][j] - r[2][i])

      distz *= distz

      dist = sqrt(distx + disty + distz)

      pot += 1.0 / dist

 

def initPositions():

  global NPARTS, DIMS, r

  for i in xrange(DIMS):

    for j in xrange(NPARTS):

      r[i][j] = 0.5 + random()

 

def updatePositions():

  global NPARTS, DIMS, r

  for i in xrange(DIMS):

    for j in xrange(NPARTS):

      r[i][j] -= 0.5 + random()

 

def main():

  global pot, NITER

  initPositions()

  updatePositions()

 

  start = clock()

  for i in xrange(NITER):

    pot = .0

    computePot()

    if not i % 10:

      print "%5d: Potential: %10.7f" % (i, pot)

    updatePositions()

  stop = clock()

  print "Seconds = %10.9f" % (stop - start)

 

bind(computePot)

bind(initPositions)

bind(updatePositions)

bind(main)

 

if __name__ == '__main__':

  main()

Psyco(使用full()):

from random import random

from time import clock

from math import sqrt

from psyco import full

 

 

full()

 

NPARTS = 1000

NITER = 201

DIMS = 3

pot = .0

r = [[.0]  * NPARTS] * DIMS

 

def computePot():

  global NPARTS, DIMS, r, pot

  for i in xrange(NPARTS):

    for j in xrange(i - 1):

      distx = (r[0][j] - r[0][i])

      distx *= distx

      disty = (r[1][j] - r[1][i])

      disty *= disty

      distz = (r[2][j] - r[2][i])

      distz *= distz

      dist = sqrt(distx + disty + distz)

      pot += 1.0 / dist

 

def initPositions():

  global NPARTS, DIMS, r

  for i in xrange(DIMS):

    for j in xrange(NPARTS):

      r[i][j] = 0.5 + random()

 

def updatePositions():

  global NPARTS, DIMS, r

  for i in xrange(DIMS):

    for j in xrange(NPARTS):

      r[i][j] -= 0.5 + random()

 

def main():

  global pot, NITER

  initPositions()

  updatePositions()

 

  start = clock()

  for i in xrange(NITER):

    pot = .0

    computePot()

    if not i % 10:

      print "%5d: Potential: %10.7f" % (i, pot)

    updatePositions()

  stop = clock()

  print "Seconds = %10.9f" % (stop - start)

 

if __name__ == '__main__':

  main()

我顺便还写了个numpy的实现:

from numpy import add, square

from numpy.random import rand

from time import clock

 

NPARTS = 1000

NITER = 201

DIMS = 3

r = rand(NPARTS, DIMS)

pot = .0

 

def computePot():

  global NPARTS, DIMS, r, pot

  for i in xrange(NPARTS):

    for j in xrange(i - 1):

        pot += 1.0 / (add.reduce(square(r[j] - r[i])) ** .5)

 

def initPositions():

  global r

  r += 0.5

 

def updatePositions():

  c = clock()

  print c

  global DIMS, NPARTS, r

  r -= (rand(NPARTS, DIMS) + .5)

  print clock() -c

 

def main():

  global pot, NITER

  initPositions()

  updatePositions()

 

  start = clock()

  for i in xrange(NITER):

    pot = .0

    computePot()

    if not i % 10:

      print "%5d: Potential: %10.7f" % (i, pot)

    updatePositions()

  stop = clock()

  print "Seconds = %10.9f" % (stop - start)

 

if __name__ == '__main__':

  main()

然而让我无语的是,numpy版比纯Python版还慢1个数量级。
稍微测试了一下,发现了原因:numpy的函数调用开销很大,因此不适合对小数组进行操作。
简单来说,计算二维数组里2列间差的平方和时,numpy调用了3次函数,而纯Python则完全是内置操作符,所以调用开销较小。而这个二维数组每列只有3个数,所以即便numpy的函数计算很快,但总体来说仍慢于纯Python版。
不过对付大数组则是numpy占据优势了:在数组长度为3时,numpy比Python慢10倍;长度为33时,2者持平;长度为330时,numpy快8倍;长度为3300时,numpy快30倍。(纯Python版的时间复杂度基本可视为O(N)。)

JavaScript:

var NPARTS = 1000;

var NPARTS2 = 2000;

var NITER = 201;

var DIMS = 3;

var SIZE = DIMS * NPARTS;

 

var r = new Array(SIZE);

var pot;

 

function initPositions() {

   for(var i = 0; i < SIZE; i++)

     r[i] = 0.5 + Math.random();

}

 

function updatePositions() {

   for (var i = 0; i < SIZE; i++)

      r[i] -= 0.5 + Math.random();

}

 

function computePot() {

   for (var i = 0; i < NPARTS; i++ ) {

      for (var j = 0, k = i - 1; j < k; j++ ) {

        var distx = r[j] - r[i];

        var disty = r[NPARTS + j] - r[NPARTS + i];

        var distz = r[NPARTS2 + j] - r[NPARTS2 + i];

        var dist = Math.sqrt(distx * distx + disty * disty + distz * distz);

        pot += 1.0 / dist;

      }

   }

}

 

initPositions();

updatePositions();

 

var start = new Date();

for (var i = 0; i < NITER; i++) {

  pot = 0.0;

  computePot();

  updatePositions();

}

var stop = new Date();

 

alert(stop - start);

Lua:

local NPARTS = 1000

local NITER = 201

local DIMS = 3

local NPARTS2 = 2000

local SIZE = DIMS * NPARTS

local pot = .0

local r = {}

local sqrt = math.sqrt

local random = math.random

 

 

function computePot()

  for i = 1, NPARTS do

    for j = 1, i - 1 do

      local distx = r[j] - r[i]

      local disty = r[NPARTS + j] - r[NPARTS + i]

      local distz = r[NPARTS2 + j] - r[NPARTS2 + i]

      pot = pot + 1.0 / sqrt(distx * distx + disty * disty + distz * distz)

    end

  end

end

 

function initPositions()

  for i = 1, SIZE do

    r[i] = 0.5 + random()

  end

end

 

function updatePositions()

  for i = 1, SIZE do

    r[i] = r[i] - (0.5 + random())

  end

end

 

initPositions()

updatePositions()

 

local start = os.clock()

for i = 1, NITER do

  pot = .0

  computePot()

  if i % 10 == 0 then

    print(string.format("%5d: Potential: %10.7f", i, pot))

  end

  updatePositions()

end

local stop = os.clock()

print(string.format("Seconds = %10.9f", stop - start))

Ruby:

$NPARTS = 1000

$NPARTS2 = 2000

$NITER = 201

$DIMS = 3

$SIZE = $DIMS * $NPARTS

$pot = 0.0

$r = [0.0]  * ($NPARTS * $DIMS)

 

def computePot()

  i = 0

  while i < $NPARTS

    j = 0

    max = i - 1

    while j < max

      distx = $r[j] - $r[i]

      disty = $r[$NPARTS + j] - $r[$NPARTS + i]

      distz = $r[$NPARTS2 + j] - $r[$NPARTS2 + i]

      $pot = $pot + 1.0 / Math.sqrt(distx * distx + disty * disty + distz * distz)

      j += 1

    end

    i += 1

  end

end

 

 

def initPositions()

  i = 0

  while i < $SIZE

    $r[i] = 0.5 + rand()

    i += 1

  end

end

 

def updatePositions()

  i = 0

  while i < $SIZE

    $r[i] -= 0.5 + rand()

    i += 1

  end

end

 

initPositions()

updatePositions()

 

start = Time.now()

i = 0

while i < $NITER

    pot = 0.0

    computePot()

    if i % 10 == 0

        print "%5d Potential %10.7f\n" % [i, $pot]

  end

    updatePositions()

  i += 1

end

stop = Time.now()

print "Seconds = %10.9f\n" % (stop - start)

Squirrel:

const NPARTS = 1000

const NITER = 201

const DIMS = 3

const NPARTS2 = 2000

const RANDMAX  = 32168.0

local SIZE = DIMS * NPARTS

local pot = 0.0

local r = []

local sqrt = sqrt

local rand = rand

 

local function computePot() {

  for(local i = 0; i < NPARTS; ++i) {

    for(local j = 0, k = i-1; j < k; ++j) {

      local distx = r[j] - r[i]

      local disty = r[NPARTS  + j] - r[NPARTS  + i]

      local distz = r[NPARTS2 + j] - r[NPARTS2 + i]

      pot += 1.0 / sqrt(distx * distx + disty * disty + distz * distz)

    }

  }

}

 

local function initPositions() {

  for(local i = 0; i < SIZE; ++i) {

    r.append(0.5 + rand() / RANDMAX)

  }

}

 

local function updatePositions() {

  for(local i = 0; i < SIZE; ++i) {

    r[i] -= 0.5 + rand() / RANDMAX

  }

}

 

initPositions()

updatePositions()

 

local start = clock()

for(local i = 0; i < NITER; ++i) {

  pot = 0.0

  computePot()

  if(i % 10 == 0) {

    print(format("%5d: Potential: %10.7f\n", i, pot))

  }

  updatePositions()

}

local stop = clock()

print(format("Seconds = %10.9f\n", stop - start))

0 0