Performance Measurement on ARM

来源：互联网发布：人工智能权威杂志编辑：程序博客网时间：2024/05/01 08:50

Performance Measurement on ARM

After working mostly with different ARM processors in the 200...400 MHz range in lots of Embedded Linux projects over the last years, we have seen an interesting development in the market recently:

ARM cpus, having been known for their low power consumption, are becoming faster and faster (example: OMAP3, Beagleboard, MX51/MX53).
x86, having been known for its high computing performance, is becoming more and more SoC-like, power friendly and slower.

If you read the marketing stuff from the chip manufacturers, it sounds like if ARM is the next x86 (in terms of performance) and x86 is the next ARM (in terms of power consumption). But where do we stand today? How fast are modern ARM derivates?

The Pengutronix Kernel team wanted to know, and so we measured, in order to get some real numbers. Here are the results, and they turn up some interesting questions. Don't take the "observations" below too scientifically - I try to sum up the results in short claims.

As ARM is explicitly a low power architecture, it would have been interesting to measure some "performance vs. power consumption" data. However, as we have done our experiments on board level products, this couldn't be done. Some manufacturers tend to put more peripheral chips on their modules than others, so we would have only measured the effects of the board BoMs.

Test Hardware

In order to find out more about the real speed of today's hardware, we collected some typical industrial hardware in our lab, so this is the list of devices we have benchmarked:

Test HardwareCPUFreq.CoreRAMKernelphyCORE-PXA270PXA270 (Marvell)520 MHzXScale (ARMv5)SDRAM2.6.34phyCORE-i.MX27MX27 (Freescale)400 MHzARM926 (ARMv5)DDR2.6.34phyCORE-i.MX35MX35 (Freescale)532 MHzARM1136 (ARMv6)DDR22.6.34O3530-PB-1452OMAP3530
(Texas Instruments)500 MHzCortex-A8 (ARMv7)DDR2.6.34Beagleboard C3OMAP3530
(Texas Instruments)500 MHzCortex-A8 (ARMv7)DDR2.6.34phyCORE-AtomZ510 (Intel)1100 MHzAtomDDR22.6.34

How fast are these boards? Yours truely assumed that the order in the table above does more or less reflect the systems in ascending performance order: PXA270 is a platform from the past, MX27 reflects the current generation of busmatrix optimized ARM9s, the ARM11 should be the next step there, Cortex-A8 appears to be the next killer platform and the Atom would probably be an order of magnitude above that.

So let's look at what we've measured.

Benchmarks

Explanatory note: In the following charts, the "error" bars (sometimes merely visible) deviantly indicate the range between minimum and maximum values of ten benchmark cycles, while the bar height shows the arithmetic mean.

Floating Point Multiplication (lat_ops)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ops&section=8

This benchmark measures the time for a floating point multiplication. It shall be an indication of the computation power and is heavily influenced by the fact whether a SoC has a hardware floating point unit or not. Here are the results:

CPUtime [ns]PXA27050.90MX2772.39MX3515.08OMAP-EVM20.13OMAP-Beagle20.11Atom Z5104.57

The PXA270 and i.MX27 both have no hardware floating point unit, so the difference between the plots seems to directly reflect the different CPU clock speed.

An interesting observation is that the MX35 (ARM1136, 532 MHz) is faster than the OMAPs (Cortex-A8, 500 MHz). The frequency differs by 6%, whereas the speed is about 25% higher.

Observation 1: So even if scaled to the same frequency, the ARM11 is faster than the Cortex-A8!

Observation 2: The Atom needs 4.5 ns; it is about twice the clock frequency of the MX35, but needs only one third of the time (which needs 15 ns).

Memory Bandwidth (bw_mem)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=bw_mem&section=8

We measure the memory transfer speed with the bw_mem benchmark.

CPUBandwidth [MB/s]PXA27054.40MX27101.30MX35128.02OMAP-EVM254.05OMAP-Beagle241.13Atom Z510601.39

Observation 3: There is a factor of 2 between the PXA270 and MX27/MX35.

Observation 4: OMAP is twice as fast as the i.MX ARM9/ARM11 ones.

Observation 5: The Atom still is 2.4 times faster than the OMAP, at 2.2 times the clock rate.

Context Switching Time (lat_ctx)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ctx&section=8

An important indicator of the system speed is the time to switch the CPU context. This benchmark measures the context switching time and it can be configured which number of processes with which size shall be tested. The processes are started, read a token from a pipe, perform a certain amount of work and give the token to the next process.

CPUContext Switch Time [µs]PXA270462.19MX27130.40MX3538.85OMAP-EVM71.51OMAP-Beagle32.29Atom Z51012.4716 processes, 8 KiB each

Observation 6: This shows impressively how slow the PXA is. Factor 40 to the Atom, and still factor 3 to the ARM926.

Observation 7: The MX35/ARM1136 has almost the same speed as the Cortex-A8. I would have thought that the newer Cortex would indeed be much faster, somewhere between the ARM11 and the Atom. But the Cortex is still three times slower than the Atom, although at half the clock rate.

Syscall Performance (lat_syscall)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_syscall§ion=8

In order to estimate the performance of calling operating system functionality, we measured the syscall latency with lat_sys. The benchmark performs an open() and close() on a 1 MB random data file located in a ramdisk (tmpfs), accessing the file with a relative path (absolute paths seem to give other results). The time for both operations after each other is measured.

CPUSyscall Time [µs]PXA27010.79MX2714.16MX358.66OMAP-EVM13.67OMAP-Beagle10.46Atom Z5105.84

Observation 7: The PXA isn't too bad when it comes to syscalls.

Observation 8: The Cortex-A8 and the ARM11 are almost identically fast.

Observation 9: Even between OMAP/ARM11 and Atom, there is only a factor of 1.8.

Process Forking (lat_proc)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_proc&section=8

The lat_proc benchmark forks processes and measures the time to do so.

CPUFork Time [µs]PXA2705426.4MX273153.6MX351365.6OMAP-EVM3052.63OMAP-Beagle1687.63Atom Z510390.66

Observation 10: The ARM11 is even better than the Cortex-A8! I had expected that the newer Cortex would perform better there.

Observation 11: The Atom is 3 times as fast, at 2 times the clock frequency.

Specifications

Kernel and -configs

An approach has been made to uniformly use kernel 2.6.34 on all targets. After optimization, care has been taken to always set the following config options:

Tree-based hierarchical RCU
Preemptible Kernel (Low-Latency Desktop)
Choose SLAB allocator (SLAB)

THUMB mode has never been used. Turning off NEON on the OMAP did not produce significantly different results. Using v5TE versus v4T is repeatedly showing worse results (not only on the OMAP), which we still not quite understand. Anyway, the figures published here for the OMAP have been obtained using a Cortex-A8 toolchain.

LMbench command lines

lat_ops

root@target:~ lat_ops 2>&1

filtered by

grep "^float mul:" | cut -f3 -d" "

bw_mem

root@target:~ list="rd wr rdwr cp fwr frd bzero bcopy"; \  for i in $list; \  do echo -en "$i\t";  done; \  echo; \  for i in $list; \  do res=$(bw_mem 33554432 $i 2>&1 | awk "{print \$2}"); \  echo -en "$res\t"; done; \  echo MB/Sec

filtered by

awk "/rd\twr\trdwr\tcp\tfwr\tfrd\tbzero\tbcopy/ { getline; print \$3 }"

lat_ctx

root@target:~ list="0 4 8 16 32 64" amount="2 4 8 16 24 32 64 96"; \  for size in $list; do lat_ctx -s $size $amount 2>&1; \  done

filtered by

grep -A4 "^\"size=8k" | grep "^16" | cut -f2 -d" "

lat_syscall

root@target:~ list="null read write stat fstat open"; \  cd /tmp; \  dd if=/dev/urandom of=test.pattern bs=1024 count=1024 2>/dev/null; \  for i in $list; do echo -en "$i\t"; done; echo; \  for i in $list; do \  res=$(lat_syscall $i test.pattern 2>&1 | awk "{print \$3}"); \  echo -en "$res\t"; done; echo microseconds

filtered by

awk "/null\tread\twrite\tstat\tfstat\topen/ { getline; print \$6 }"

lat_proc

root@target:~ list="procedure fork exec shell"; \  cp /usr/bin/hello /tmp; \  for i in $list; do echo -en "$i\t"; done ;echo; \  for i in $list; do res=$(lat_proc $i 2>&1 | awk "{FS=\":\"} ; {print \$2}" \  | awk "{print \$1}"); echo -en "$res\t"; done ; echo microseconds

filtered by

awk "/procedure\tfork\texec\tshell/ { getline; print \$2 }"

Thinking about Caches

The influence of Linux caches is not much of an issue, as ensuring cold caches by directly preceding every lmbench invocation by

sync; echo 3> /proc/sys/vm/drop_caches;

leads to only slightly (0% to 3%, depending on type of benchmark) worse figures, which has been verified on several targets.

GCC Flags

Here is an overview of the compiler variants used, with some of the relevant config switches (as shown by gcc -v) jotted down. The last column links to snippets taken from the output of objdump -d lat_ops.o, enabling comparison of the code used in float_mul.

CPUCompilerFloating Point CodePXA270arm-iwmmx-linux-gnueabi-gcclat_ops do_float_mul objdump: __aeabi_fmulMX27arm-v5te-linux-gnueabi-gcclat_ops do_float_mul objdump: __aeabi_fmulMX35arm-1136jfs-linux-gnueabi-gcclat_ops do_float_mul objdump: vmul.f32OMAP-EVMarm-cortexa8-linux-gnueabi-gcclat_ops do_float_mul objdump: vmul.f32OMAP-Beaglearm-cortexa8-linux-gnueabi-gcclat_ops do_float_mul objdump: vmul.f32Atom Z510i586-unknown-linux-gnu-gcclat_ops do_float_mul objdump: fmul

Resulting from the use of PTXdist as build system, the gcc flags in action are uniformly the same across all targets.

With LMbench's bw_mem as an example, the complete compiler command line in its original order is

-DPACKAGE_NAME=\"lmbench\"-DPACKAGE_TARNAME=\"lmbench\"-DPACKAGE_VERSION=\"trunk\"-DPACKAGE_STRING=\"lmbench\ trunk\"-DPACKAGE_BUGREPORT=\"bugs@pengutronix.de\"-DPACKAGE_URL=\"\"-DSTDC_HEADERS=1-DHAVE_SYS_TYPES_H=1-DHAVE_SYS_STAT_H=1-DHAVE_STDLIB_H=1-DHAVE_STRING_H=1-DHAVE_MEMORY_H=1-DHAVE_STRINGS_H=1-DHAVE_INTTYPES_H=1-DHAVE_STDINT_H=1-DHAVE_UNISTD_H=1-DHAVE_DLFCN_H=1-DLT_OBJDIR=\".libs/\"-DPACKAGE=\"lmbench\"-DVERSION=\"trunk\"-I.-I../include-I../include-isystem /home/.../sysroot-target/include-isystem /home/.../sysroot-target/usr/include-W-Wall-O2-DHAVE_uint-DHAVE_uint64_t-DHAVE_int64_t-DHAVE_socklen_t-DHAVE_DRAND48-DHAVE_RAND-DHAVE_RANDOM-MT bw_mem.o-MD-MP-MF .deps/bw_mem.Tpo-c-o bw_mem.obw_mem.c

Conclusion

These measurements are probably not completely scientifically correct. The intention was to give us a raw idea of how the systems perform.

We expected the Cortex-A8 to be an order of magnitude faster than the ARM11. This doesn't seem to be the case. Only the memory bandwidth is much faster, but most of the other benchmarks show almost the same values. It's currently totally unclear to us where the performance win we expected from an ARMv7 over an ARMv6 core went to.

There seems to be a pattern that, at double the clock frequency, the Atom is often three times faster than the ARM11/Cortex-A8.

Feedback

Do you have any remarks, ideas about the observed effects and other things you might want to tell us? We want to improve this article with the help of the community. So please send us your feedback to the mail address in the box below.

Thanks to ...... for ...Juergen Beisertspelling fixesJochen Frielingall the measurementsAndreas Gajdaspelling fixesMartin Guyspelling fixesMarc Kleine-Buddeporting all kernels to 2.6.34Uwe Kleine-Koenigspelling fixesMagnus Liljaideas and suggestionsNicholas Pitrecomments about power vs. performanceBaruch Siachfix arm cpu typesColin Tuckleyideas and suggestions

0 0