SoC performance benchmark
来源:互联网 发布:c#json转化为数组 编辑:程序博客网 时间:2024/05/05 01:09
Preface
This article would illustrate the programs used to benchmark the SoC(include the SMP) performance, also the step to build and run the benchmark programs. And at the end, I give 2 scripts to make the benchmark work more efficiently.
These benchmark programs would evaluate the Integer and FP performance, also the latency of the L1-Cache and L2-Cache. We can fetch these tools from net. And some of them comes from the lmbench. For the lmbench you may view my previous blog post(In Chinese).ARM Linux BenchMark. Also refer the github repo which suit the previous blog post:
https://github.com/tonyho/ARM_BenchMark
Besides, if you want to compare the SoC in the phone and the arm linux board, you can do these:
①Install the benchmark apks(the roylongbottom collect and modify many benchmarks tools for Android) to android phone to make a benchmark
②then use the below repo tools to run a benchmark in ARM linux board:
https://github.com/tonyho/ARM-MP-BenchMark
③compare the result
1. Integer BenchMark: CoreMark(version:1.01)
compile:
downlaod the coremark from http://www.eembc.org/
①compile the source code for single core CPU:
arm-poky-linux-gnueabi-gcc -c -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a15 -I./ -Isimple -DITERATIONS=0 -DSEED_METHOD=SEED_ARG -DCOMPILER_FLAGS=\""-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15-Os\"" -Os core_main.c core_list_join.c core_matrix.c core_state.c core_util.c simple/core_portme.c
Link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark -lc
For static link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark.static -lc -static
②compile the source code for multicore CPU:
cp linux/ -r arm_ti
#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak
#build the coremark:
make PORT_DIR=./arm_ti/ XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1"make PORT_DIR=./arm_ti/ REBUILD=1
③Toolchain problem
for these ToolChain cannot pass the string macro which contain space, such as the toolchain built by Yocto 1.6.1
cp linux/ -r arm_ti
#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak
build the source code, the output executable object is coremark.exe:
make clean && arm-poky-linux-gnueabi-gcc -O2 -I./arm_ti/ -I. -DFLAGS_STR=\""-O2-DMULTITHREAD=2-DUSE_FORK=1-DPERFORMANCE_RUN=1-lrt"\" -DITERATIONS=0 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c ./arm_ti//core_portme.c -o ./coremark.exe -lrt
usage:
1. copy the coremark (for multicore is coremark.exe) to /usr/bin
cp coremark/coremark.exe ...
2. run the coremark
Replace the ITER_PROFILE to a number, make sure that the number can make the coremark run at least 1 min.
time coremark/coremark.exe 0x0 0x0 0x66 ITER_PROFILE 7 1 2000
3. get the average result
When the coremark print the result,rerun the coremark for several times, pick the Iterations/Sec value, get the average, fill the table. Eg:
time coremark 0x0 0x0 0x66 400000 7 1 2000
①single core result log example
2K performance run parameters for coremark.CoreMark Size : 666Total ticks : 250749878Total time (secs): 250.749878Iterations/Sec : 1595.215133Iterations : 400000Compiler version : GCC4.8.3 20140401 (prerelease)Compiler flags : arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15Memory location : STACKseedcrc : 0xe9f5[0]crclist : 0xe714[0]crcmatrix : 0x1fd7[0]crcstate : 0x8e3a[0]crcfinal : 0x65c5Correct operation validated. See readme.txt for run and reporting rules.CoreMark 1.0 : 1595.215133 / GCC4.8.3 20140401 (prerelease) arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 / STACKreal 4m10.831suser 4m10.750ssys 0m0.000s
②multicore/multithread result log example
2K performance run parameters for coremark.CoreMark Size : 666Total ticks : 58661Total time (secs): 58.661000 Iterations/Sec : 9546.376639 Iterations : 560000 Compiler version : GCC4.8.3 20140401 (prerelease) Compiler flags : -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt Parallel Fork : 2 Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [1]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [1]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [1]crcstate : 0x8e3a [0]crcfinal : 0xbd59 [1]crcfinal : 0xbd59 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 9546.376639 / GCC4.8.3 20140401 (prerelease) -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt / Heap / 2:Fork real 0m58.670s user 1m57.260s sys 0m0.000s
For more detail, refer the ARM document: CoreMark Benchmarking for ARM Cortex Processors
2. Float BenchMark
use the lat_ops form lmbench(version:3.0), single core test program
1. program position
lmbench/bin/lat_ops, copy the lmbench to target board
cp -r lmbench /
2. run
change the working directory to lmbench/bin/arm-linux, and run the lat_ops for several times and get avarage value as the result value:
for example:
root@xxx:/# cd /lmbench/bin/arm-linux/ root@xxx:/lmbench/bin/arm-linux# ./lat_ops integer bit: 0.67 nanoseconds integer add: 0.67 nanoseconds integer mul: 2.08 nanoseconds integer div: 57.43 nanoseconds integer mod: 8.11 nanoseconds int64 bit: 0.68 nanoseconds uint64 add: 0.74 nanoseconds int64 mul: 3.36 nanoseconds int64 div: 90.15 nanoseconds int64 mod: 62.60 nanoseconds float add: 3.36 nanoseconds float mul: 4.04 nanoseconds float div: 12.14 nanoseconds double add: 3.36 nanoseconds double mul: 4.04 nanoseconds double div: 21.52 nanoseconds float bogomflops: 10.77 nanoseconds double bogomflops: 20.20 nanoseconds
3. L1 L2 Cache Latency BenchMark
use the lat_mem_rd from lmbench(version:3.0), single core test program
1. prepare
program position: lmbench/bin/lat_mem_rd, copy the lmbench to target board
cp -r lmbench /
2. run
change the working directory to lmbench/bin/arm-linux, and run the lat_mem_rd for several times and get average value as the result value.
./lat_mem_rd 1M
In program output log, the following is the latency value:
0.00098-->L1 Cache
0.12500-->L2 Cache
eg:
root@xxx:/lmbench/bin/arm-linux# ./lat_mem_rd 1M"stride=1280.00049 2.6870.00098 2.6880.00195 2.6880.00293 2.6880.00391 2.6690.00586 2.6690.00781 2.6690.01172 2.6690.01562 2.6690.02344 8.7080.03125 7.1980.04688 13.6870.06250 13.1890.09375 14.6830.12500 14.6830.18750 14.7460.25000 14.7460.37500 14.7830.50000 14.9330.75000 27.5381.00000 70.250
4. DMIPS BenchMark
Use the Dhrystone(version:2.1), single core test program
1.Get the source
get the source from: http://www.roylongbottom.org.uk/linux%20benchmarks.htm#anchor4
wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz' wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch' tar -xzf classic_benchmarks.tar.gz patch -p0 < Classic_benchmarks.patch cd classic_benchmarks/source_code/
2. Setting the tuning options
change the toolchain path, and tuning options:
gvim Makefile
CC=gcc-4.7 ==> CC=XXXX-gcc CFLAGS=-static -O3 -mcpu=cortex-A8 -mtune=cortex-A8 -mfpu=neon -funroll-loops ==> CFLAGS=-static -O3 -mcpu=cortex-A15 -mtune=cortex-A15 -mfpu=neon -funroll-loops
3. change the SoC type string, and CPU frequency
gvim common_32bit/cpuidc.c
Change the string and SoC frequency:
strcpy(idString1, "Cortex A8"); ==> strcpy(idString1, "Cortex A15"); megaHz = 1000; ==> megaHz = 1500;
4. build the program
make
5. run the dhry2 test program
1. cp dhry2 to target board, and add the execution attribute for the file, and run it:
cp dhry2 XXXX chmod a+x ./dhry2 ./dhry2
2. the VAX MIPS rating is the DMIPS value, rerun for several times, and get the average as the result
eg:
root@xxx:/# dhry2####################################################getDetails and MHzAssembler CPUID and RDTSC CPU Cortex A8, Features Code 00000000, Model Code 00000000Measured - Minimum 1500 MHz, Maximum 1500 MHzLinux Functionsget_nprocs() - CPUs 2, Configured CPUs 2get_phys_pages() and size - RAM Size 1.97 GB, Page Size 4096 Bytesuname() - Linux, saturn15, 3.10.31-ltsi#1 SMP PREEMPT Tue Dec 9 13:39:16 JST 2014, armv7l##########################################Dhrystone Benchmark, Version 2.1 (Language: C or C++)Optimisation Opt 3 64 BitRegister option not selected40000 runs 0.00 seconds 400000 runs 0.05 seconds 4000000 runs 0.49 seconds 8000000 runs 0.97 seconds 16000000 runs 1.94 seconds 32000000 runs 3.89 secondsFinal values (* implementation-dependent):Int_Glob: O.K. 5 Bool_Glob: O.K. 1Ch_1_Glob: O.K. A Ch_2_Glob: O.K. BArr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 32000010Ptr_Glob-> Ptr_Comp: * 610704Discr: O.K. 0 Enum_Comp: O.K. 2Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRINGNext_Ptr_Glob-> Ptr_Comp: * 610704 same as aboveDiscr: O.K. 0 Enum_Comp: O.K. 1Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRINGInt_1_Loc: O.K. 5 Int_2_Loc: O.K. 13Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRINGStr_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRINGMicroseconds for one run through Dhrystone: 0.12 Dhrystones per Second: 8232458 VAX MIPS rating = 4685.52Press Enter
6. Scripts
For the benchmark, we usually would run the test for several times, then averages all these results to get a final result. And I have written two scripts to do these.
There're 2 scripts my bitbucket snippet: CPU_BenchMark_Scripts:
- CPUBenchMark_Average.sh: run in host or target board which has the bash and awk and grep
- CPU_RunBenchMark.sh: run on the target
The CPU_RunBenchMark.sh would run the benchmark programs to get the results and store the results in the PROGRAM_NAME.log, the PROGRAM_NAME is the program name. eg: coremark.
The CPUBenchMark_Average.sh is used to average the results which store in the PROGRAM_NAME .log.
So below is the step to use the scripts:
①Copy the benchmark programs(coremark.exe dhry2 lat_ops lat_mem_rd) to target board
②Copy the CPU_RunBenchMark.sh and CPUBenchMark_Average.sh to the same directory as benchmark programs
③Modify the CPU_RunBenchMark.sh to suit the directory
runTest coremark_v1.0 'time ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000' coremark.log runTest classic_benchmarks/source_code 'echo | ./dhry2' dhry2.log 10runTest lmbench/bin/arm-linux './lat_ops' lat_ops.logrunTest lmbench/bin/arm-linux './lat_mem_rd 1M' lat_mem_rd.log
the runTest shell function is used to run a program ($2) which in the directory $1.
④Modify the for loop for the times of benchmark programs run.
for i in 1 2 3 4 5 6 7 8 9 10;doeval "$2" 2>&1 | tee -a $3done
⑤Average the results
Just run the CPUBenchMark_Average.sh if the target board shipped the grep awk, if the target board don't have these tools, copy the logs and scripts to host PC to run, it would output the result to STDOUT, eg:
$ sh average.sh ===========CoreMark================================Iterations/Sec = 9569.107810===========Dhry2===================================VAX MIPS rating = 4685.468000===========L1 Lat==================================0.00098 = 2.669300===========L2 Lat==================================0.12500 = 14.684400===========integer=================================integer bit = 0.670000integer add = 0.670000integer mul = 2.070000integer div = 56.908000integer mod = 8.044000===========int64==================================int64 bit = 0.670000uint64 add = 0.710000int64 mul = 3.340000int64 div = 89.491000int64 mod = 62.155000===========float==================================float add = 3.340000float mul = 4.009000float div = 12.022000===========double=================================double add = 3.340000double mul = 4.010000double div = 21.372000===========float/double bogo======================float bogomflops = 10.688000double bogomflops = 20.038000
如果文章有格式问题,请移步:http://www.hexiongjun.com/?p=174
转载请注明出处。作者:TonyHo hexiongjun.com
- SoC performance benchmark
- NPerf, A Performance Benchmark Framework for .Net
- HPCC(high performance challenge computer)benchmark安装方法
- Java String vs StringBuilder vs StringBuffer Concatenation Performance Micro Benchmark
- soc
- SoC
- SoC
- benchmark
- Benchmark
- benchmark
- Benchmark
- High Performance MySQL作者对TokyoTyrant做的性能测试(benchmark)
- Performance
- Performance
- Performance
- Performance
- performance
- 什麼是SoC
- [Request processing failed; nested exception is java.lang.NullPointerException] with root cause
- iOS下的app和h5交互
- Python待完善
- 对话框大合集
- Eclipse 出现Select at least one Project的问题
- SoC performance benchmark
- 猜成语,java
- 一、机器学习系统设计笔记之python机器学习入门
- 聚类分析学习
- Perfect Squares -- leetcode
- VC、C++彩信接口开发经验及具体开发实现
- Android启动线程的几种方法
- 中国软件开发工程师之痛
- JQuery将时间戳转换为时间