SoC performance benchmark

来源:互联网 发布:c#json转化为数组 编辑:程序博客网 时间:2024/05/05 01:09

Preface

This article would illustrate the programs used to benchmark the SoC(include the SMP) performance, also the step to build and run the benchmark programs.  And at the end, I give 2 scripts to make the benchmark work more efficiently.

These benchmark programs would evaluate the Integer and FP performance, also the latency of the L1-Cache and L2-Cache. We can fetch these tools from net. And some of them comes from the lmbench. For the lmbench you may view my previous blog post(In Chinese).ARM Linux BenchMark. Also refer the github repo which suit the previous blog post:

https://github.com/tonyho/ARM_BenchMark

Besides, if you want to compare the SoC in the phone  and the arm linux board, you can do these:

①Install the benchmark apks(the roylongbottom collect and modify many benchmarks tools for Android) to android phone to make a benchmark

②then use the below repo tools to run a benchmark in ARM linux board:

https://github.com/tonyho/ARM-MP-BenchMark

③compare the result

1. Integer BenchMark: CoreMark(version:1.01)

compile:

downlaod the coremark from http://www.eembc.org/

①compile the source code for single core CPU:
arm-poky-linux-gnueabi-gcc -c -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a15 -I./ -Isimple -DITERATIONS=0 -DSEED_METHOD=SEED_ARG -DCOMPILER_FLAGS=\""-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15-Os\"" -Os core_main.c core_list_join.c core_matrix.c core_state.c core_util.c simple/core_portme.c

Link:

arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark -lc

For static link:

arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark.static -lc -static
②compile the source code for multicore CPU:
cp linux/ -r arm_ti

#Modify the CC and LD to cross compile toolchain gcc

gvim arm_ti/core_portme.mak

#build the coremark:

make PORT_DIR=./arm_ti/ XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1"make PORT_DIR=./arm_ti/ REBUILD=1

③Toolchain problem
for these ToolChain cannot pass the string macro which contain space, such as the toolchain built by Yocto 1.6.1

cp linux/ -r arm_ti

#Modify the CC and LD to cross compile toolchain gcc

gvim arm_ti/core_portme.mak

build the source code, the output executable object is coremark.exe:

make clean && arm-poky-linux-gnueabi-gcc -O2 -I./arm_ti/ -I. -DFLAGS_STR=\""-O2-DMULTITHREAD=2-DUSE_FORK=1-DPERFORMANCE_RUN=1-lrt"\" -DITERATIONS=0 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c ./arm_ti//core_portme.c -o ./coremark.exe -lrt

usage:

1. copy the coremark (for multicore is coremark.exe) to /usr/bin
cp coremark/coremark.exe ...
2. run the coremark

Replace the ITER_PROFILE to a number, make sure that the number can make the coremark run at least 1 min.

time coremark/coremark.exe 0x0 0x0 0x66 ITER_PROFILE 7 1 2000
3. get the average result

When the coremark print the result,rerun the coremark for several times, pick the Iterations/Sec value, get the average, fill the table. Eg:

time coremark 0x0 0x0 0x66 400000 7 1 2000
①single core result log example
2K performance run parameters for coremark.CoreMark Size : 666Total ticks : 250749878Total time (secs): 250.749878Iterations/Sec : 1595.215133Iterations : 400000Compiler version : GCC4.8.3 20140401 (prerelease)Compiler flags : arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15Memory location : STACKseedcrc : 0xe9f5[0]crclist : 0xe714[0]crcmatrix : 0x1fd7[0]crcstate : 0x8e3a[0]crcfinal : 0x65c5Correct operation validated. See readme.txt for run and reporting rules.CoreMark 1.0 : 1595.215133 / GCC4.8.3 20140401 (prerelease) arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 / STACKreal 4m10.831suser 4m10.750ssys 0m0.000s
②multicore/multithread result log example


2K performance run parameters for coremark.CoreMark Size : 666Total ticks : 58661Total time (secs): 58.661000 Iterations/Sec : 9546.376639 Iterations : 560000 Compiler version : GCC4.8.3 20140401 (prerelease) Compiler flags : -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt Parallel Fork : 2 Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [1]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [1]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [1]crcstate : 0x8e3a [0]crcfinal : 0xbd59 [1]crcfinal : 0xbd59 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 9546.376639 / GCC4.8.3 20140401 (prerelease) -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt / Heap / 2:Fork real 0m58.670s user 1m57.260s sys 0m0.000s

For more detail, refer the ARM document: CoreMark Benchmarking for ARM Cortex Processors

2. Float BenchMark

use the lat_ops form lmbench(version:3.0), single core test program

1. program position

lmbench/bin/lat_ops, copy the lmbench to target board

cp -r lmbench /

2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_ops for several times and get avarage value as the result value:
for example:

root@xxx:/# cd /lmbench/bin/arm-linux/ root@xxx:/lmbench/bin/arm-linux# ./lat_ops integer bit: 0.67 nanoseconds integer add: 0.67 nanoseconds integer mul: 2.08 nanoseconds integer div: 57.43 nanoseconds integer mod: 8.11 nanoseconds int64 bit: 0.68 nanoseconds uint64 add: 0.74 nanoseconds int64 mul: 3.36 nanoseconds int64 div: 90.15 nanoseconds int64 mod: 62.60 nanoseconds float add: 3.36 nanoseconds float mul: 4.04 nanoseconds float div: 12.14 nanoseconds double add: 3.36 nanoseconds double mul: 4.04 nanoseconds double div: 21.52 nanoseconds float bogomflops: 10.77 nanoseconds double bogomflops: 20.20 nanoseconds

3. L1 L2 Cache Latency BenchMark

use the lat_mem_rd from lmbench(version:3.0), single core test program

1. prepare

program position: lmbench/bin/lat_mem_rd, copy the lmbench to target board

cp -r lmbench /

2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_mem_rd for several times and get average value as the result value.

./lat_mem_rd 1M

In program output log, the following is the latency value:
0.00098-->L1 Cache
0.12500-->L2 Cache
eg:

root@xxx:/lmbench/bin/arm-linux# ./lat_mem_rd 1M"stride=1280.00049 2.6870.00098 2.6880.00195 2.6880.00293 2.6880.00391 2.6690.00586 2.6690.00781 2.6690.01172 2.6690.01562 2.6690.02344 8.7080.03125 7.1980.04688 13.6870.06250 13.1890.09375 14.6830.12500 14.6830.18750 14.7460.25000 14.7460.37500 14.7830.50000 14.9330.75000 27.5381.00000 70.250

4. DMIPS BenchMark

Use the Dhrystone(version:2.1), single core test program

1.Get the source

get the source from: http://www.roylongbottom.org.uk/linux%20benchmarks.htm#anchor4

wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz' wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch' tar -xzf classic_benchmarks.tar.gz patch -p0 < Classic_benchmarks.patch cd classic_benchmarks/source_code/


2. Setting the tuning options

change the toolchain path, and tuning options:

gvim Makefile 
CC=gcc-4.7 ==> CC=XXXX-gcc CFLAGS=-static -O3 -mcpu=cortex-A8 -mtune=cortex-A8 -mfpu=neon -funroll-loops ==> CFLAGS=-static -O3 -mcpu=cortex-A15 -mtune=cortex-A15 -mfpu=neon -funroll-loops

3. change the SoC type string, and CPU frequency

gvim common_32bit/cpuidc.c

Change the string and SoC frequency:

strcpy(idString1, "Cortex A8"); ==> strcpy(idString1, "Cortex A15"); megaHz = 1000; ==> megaHz = 1500;

4. build the program

make

5. run the dhry2 test program

1. cp dhry2 to target board, and add the execution attribute for the file, and run it:

cp dhry2 XXXX chmod a+x ./dhry2 ./dhry2

2. the VAX MIPS rating is the DMIPS value, rerun for several times, and get the average as the result
eg:

root@xxx:/# dhry2####################################################getDetails and MHzAssembler CPUID and RDTSC CPU Cortex A8, Features Code 00000000, Model Code 00000000Measured - Minimum 1500 MHz, Maximum 1500 MHzLinux Functionsget_nprocs() - CPUs 2, Configured CPUs 2get_phys_pages() and size - RAM Size 1.97 GB, Page Size 4096 Bytesuname() - Linux, saturn15, 3.10.31-ltsi#1 SMP PREEMPT Tue Dec 9 13:39:16 JST 2014, armv7l##########################################Dhrystone Benchmark, Version 2.1 (Language: C or C++)Optimisation Opt 3 64 BitRegister option not selected40000 runs 0.00 seconds 400000 runs 0.05 seconds 4000000 runs 0.49 seconds 8000000 runs 0.97 seconds 16000000 runs 1.94 seconds 32000000 runs 3.89 secondsFinal values (* implementation-dependent):Int_Glob: O.K. 5 Bool_Glob: O.K. 1Ch_1_Glob: O.K. A Ch_2_Glob: O.K. BArr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 32000010Ptr_Glob-> Ptr_Comp: * 610704Discr: O.K. 0 Enum_Comp: O.K. 2Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRINGNext_Ptr_Glob-> Ptr_Comp: * 610704 same as aboveDiscr: O.K. 0 Enum_Comp: O.K. 1Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRINGInt_1_Loc: O.K. 5 Int_2_Loc: O.K. 13Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRINGStr_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRINGMicroseconds for one run through Dhrystone: 0.12 Dhrystones per Second: 8232458 VAX MIPS rating = 4685.52Press Enter

6. Scripts

For the benchmark, we usually would run the test for several times, then averages all these results to get a final result. And I have written two scripts to do these.

There're 2 scripts my bitbucket snippet: CPU_BenchMark_Scripts:

  • CPUBenchMark_Average.sh: run in host or target board which has the bash and awk and grep
  • CPU_RunBenchMark.sh: run on the target


The CPU_RunBenchMark.sh would run the benchmark programs to get the results and store the results in the PROGRAM_NAME.log, the PROGRAM_NAME is the program name. eg: coremark.

The CPUBenchMark_Average.sh is used to average the results which store in the PROGRAM_NAME .log.

So below is the step to use the scripts:

①Copy the benchmark programs(coremark.exe dhry2 lat_ops lat_mem_rd) to target board

②Copy the CPU_RunBenchMark.sh and CPUBenchMark_Average.sh to the same directory as benchmark programs

③Modify the CPU_RunBenchMark.sh to suit the directory

runTest coremark_v1.0 'time ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000' coremark.log runTest classic_benchmarks/source_code 'echo | ./dhry2' dhry2.log 10runTest lmbench/bin/arm-linux './lat_ops' lat_ops.logrunTest lmbench/bin/arm-linux './lat_mem_rd 1M' lat_mem_rd.log

the runTest shell function is used to run a program ($2) which in the directory $1.

④Modify the for loop for the times of benchmark programs run.

for i in 1 2 3 4 5 6 7 8 9 10;doeval "$2" 2>&1 | tee -a $3done

⑤Average the results

Just run the CPUBenchMark_Average.sh if the target board shipped the grep awk, if the target board don't have these tools, copy the logs and scripts to host PC to run, it would output the result to STDOUT, eg:

$ sh average.sh ===========CoreMark================================Iterations/Sec = 9569.107810===========Dhry2===================================VAX MIPS rating = 4685.468000===========L1 Lat==================================0.00098 = 2.669300===========L2 Lat==================================0.12500 = 14.684400===========integer=================================integer bit = 0.670000integer add = 0.670000integer mul = 2.070000integer div = 56.908000integer mod = 8.044000===========int64==================================int64 bit = 0.670000uint64 add = 0.710000int64 mul = 3.340000int64 div = 89.491000int64 mod = 62.155000===========float==================================float add = 3.340000float mul = 4.009000float div = 12.022000===========double=================================double add = 3.340000double mul = 4.010000double div = 21.372000===========float/double bogo======================float bogomflops = 10.688000double bogomflops = 20.038000

如果文章有格式问题,请移步:http://www.hexiongjun.com/?p=174

转载请注明出处。作者:TonyHo hexiongjun.com 


0 0
原创粉丝点击