GGC 编译Intrinsic

来源：互联网发布：淘宝网店信誉怎么升编辑：程序博客网时间：2024/06/05 05:21

http://www.linuxjournal.com/content/introduction-gcc-compiler-intrinsics-vector-processing?page=0,1

http://stackoverflow.com/questions/7156908/sse-intrinsic-functions-reference

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/ OptionsX86/MMX/SSE1/SSE2-mfpmath=sse -mmmx -msse -msse2ARM Neon-mfpu=neon -mfloat-abi=softfpFreescale Altivec-maltivec -mabi=altivec

Here are the include files you need:

arm_neon.h - ARM Neon types & intrinsics
altivec.h - Freescale Altivec types & intrinsics
mmintrin.h - X86 MMX
xmmintrin.h - X86 SSE1
emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
SSE1: __m128 128 bits: four single precision floats.
SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.

Table 2. Subset of vector operators and intrinsics used in the examples.

Operation

Altivec

Neon

MMX/SSE/SSE2

vec_ld

vld1q_f32

_mm_set_epi16

vector

vec_splat

vld1q_s16

_mm_set1_epi16

vec_splat_s16

vsetq_lane_f32

_mm_set1_pi16

vec_splat_s32

vld1_u8

_mm_set_pi16

vec_splat_s8

vdupq_lane_s16

_mm_load_ps

vec_splat_u16

vdupq_n_s16

_mm_set1_ps

vec_splat_u32

vmovq_n_f32

_mm_loadh_pi

vec_splat_u8

vset_lane_u8

_mm_loadl_pi

storing

vec_st

vst1_u8

vector

vst1q_s16

_mm_store_ps

vst1q_f32

vst1_s16

add

vec_madd

vaddq_s16

_mm_add_epi16

vec_mladd

vaddq_f32

_mm_add_pi16

vec_adds

vmlaq_n_f32

_mm_add_ps

subtract

vec_sub

vsubq_s16

multiply

vec_madd

vmulq_n_s16

_mm_mullo_epi16

vec_mladd

vmulq_s16

_mm_mullo_pi16

vmulq_f32

_mm_mul_ps

vmlaq_n_f32

arithmetic

vec_sra

vshrq_n_s16

_mm_srai_epi16

shift

vec_srl

_mm_srai_pi16

vec_sr

byte

vec_perm

vtbl1_u8

_mm_shuffle_pi16

permutation

vec_sel

vtbx1_u8

_mm_shuffle_ps

vec_mergeh

vget_high_s16

vec_mergel

vget_low_s16

vdupq_lane_s16

vdupq_n_s16

vmovq_n_f32

vbsl_u8

type

vec_cts

vmovl_u8

_mm_packs_pu16

conversion

vec_unpackh

vreinterpretq_s16_u16

vec_unpackl

vcvtq_u32_f32

vec_cts

vqmovn_s32

_mm_cvtps_pi16

vec_ctu

vqmovun_s16

_mm_packus_epi16

vqmovn_u16

vcvtq_f32_s32

vmovl_s16

vmovq_n_f32

vector

vec_pack

vcombine_u16

combination

vec_packsu

vcombine_u8

vcombine_s16

maximum

_mm_max_ps

minimum

_mm_min_ps

vector

_mm_andnot_ps

logic

_mm_and_ps

_mm_or_ps

rounding

vec_trunc

misc

_mm_empty

Check Processor at Runtime

Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)

Summary

In summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.

References:

The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins:

http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.html

http://ds9a.nl/gcc-simd/
http://softpixel.com/~cwright/programming/simd/index.php

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html
http://www.arm.com/products/processors/technologies/neon.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html

http://www.tommesani.com/Docs.html
http://www.linuxjournal.com/article/7269

http://developer.apple.com/hardwaredrivers/ve/sse.html
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
http://www.ibm.com/developerworks/power/library/pa-unrollav1/
http://en.wikipedia.org/wiki/MMX_(instruction_set)

Integrated Performance Primitives
http://software.intel.com/en-us/articles/intel-ipp/
http://software.intel.com/en-us/articles/non-commercial-software-download/

OpenMAX
http://www.khronos.org/developers/resources/openmax

Freescale AltiVec Libs for Linux
http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCNWALTVCLIB

AltiVec TM Technology Programming Interface Manual
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html

Ian Ollmann's Altivec Tutorial
http://www-linux.gsi.de/~ikisel/reco/Systems/Altivec.pdf
http://arstechnica.com/civis/viewtopic.php?f=19&t=381165

RealView Compilation Tools Compiler Reference Guide (especially Appendix E)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/index.html

RealView Compilation Tools Assembler Guide (esp chapter 5)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/index.html

Intel C++ Intrinsics Reference

http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf