浅谈Gcc优化

来源：互联网发布：淘宝客伴侣龙腾管家编辑：程序博客网时间：2024/05/17 09:10

导师让总结一下gcc的优化选项，摘抄一下gcc 的 Manual

gcc -fopenmp -O2 -o hellomp.out hellomp.c

-o file后接生成的可执行文件名。

Place output in file file. This applies regardless to whatever sort

of output is being produced, whetherit be an executable file, an

object file, an assembler file orpreprocessed C code.

If -o is not specified, the defaultis to put an executable file in

a.out, the object file forsource.suffix in source.o, its assembler

file in source.s, a precompiledheader file in source.suffix.gch, and

all preprocessed C source onstandard output.

-fopenmp 使用OpenMp

Enable handling of OpenMP directives"#pragma omp" in C/C++ and "!$omp" in Fortran. When -fopenmp is specified, the compilergenerates parallel code according to the OpenMP Application Program Interface v2.5<http://www.openmp.org/>. Thisoption implies -pthread, and thus is only supported on targets that have support for -pthread..

OptionsThat Control Optimization优化选项控制

These options control various sorts ofoptimizations.

Without any optimization option, thecompiler’s goal is to reduce the cost of compilation and to make debuggingproduce the expected results. Statementsare

independent: if you stop the programwith a breakpoint between statements, you can then assign a new value to anyvariable or change the program counter to any

other statement in the function and getexactly the results you would expect from the source code.

Turning on optimization flags makes thecompiler attempt to improve the performance and/or code size at the expense ofcompilation time and possibly the ability to

debug the program.

The compiler performs optimization basedon the knowledge it has of the program. Compiling multiple files at once to a single output file mode allows thecompiler

to use information gained from all ofthe files when compiling each of them.

Not all optimizations are controlleddirectly by a flag. Only optimizationsthat have a flag are listed.

-O

-O1 Optimize. Optimizing compilation takes somewhat moretime, and a lot more memory for a large function.

With -O, the compiler tries toreduce code size and execution time, without performing any optimizations thattake a great deal of compilation time.

-O turns on the followingoptimization flags:

-fauto-inc-dec -fcprop-registers-fdce -fdefer-pop -fdelayed-branch -fdse -fguess-branch-probability-fif-conversion2 -fif-conversion -finline-small-functions

-fipa-pure-const -fipa-reference-fmerge-constants -fsplit-wide-types -ftree-builtin-call-dce -ftree-ccp-ftree-ch -ftree-copyrename -ftree-dce

-ftree-dominator-opts -ftree-dse-ftree-fre -ftree-sra -ftree-ter -funit-at-a-time

-O also turns on-fomit-frame-pointer on machines where doing so does not interfere withdebugging.

-O2 Optimize even more. GCC performs nearly all supportedoptimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both

compilation time and the performanceof the generated code.

-O2 turns on all optimization flagsspecified by -O. It also turns on thefollowing optimization flags: -fthread-jumps -falign-functions -falign-jumps

-falign-loops -falign-labels -fcaller-saves -fcrossjumping-fcse-follow-jumps -fcse-skip-blocks-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse

-fgcse-lm -findirect-inlining-foptimize-sibling-calls -fpeephole2 -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop-fsched-interblock

-fsched-spec -fschedule-insns -fschedule-insns2 -fstrict-aliasing-fstrict-overflow -ftree-switch-conversion -ftree-pre -ftree-vrp

Please note the warning under -fgcseabout invoking -O2 on programs that use computed gotos.

-O3 Optimize yet more. -O3 turns on all optimizations specified by-O2 and also turns on the -finline-functions, -funswitch-loops,-fpredictive-commoning,

-fgcse-after-reload,-ftree-vectorize and -fipa-cp-clone options.

-O0 Reduce compilation time and makedebugging produce the expected results. This is the default.

-Os Optimize for size. -Os enables all -O2 optimizations that do nottypically increase code size. It alsoperforms further optimizations designed to reduce code

size.

-Os disables the followingoptimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition

-fprefetch-loop-arrays -ftree-vect-loop-version

If you use multiple -O options, withor without level numbers, the last such option is the one that is effective.

也就是说 -O 表示 Options That Control Optimization，默认是O0，而O1,O2,O3分别增加了编译时间，降低了编译生成文件的执行效率，Os则是对size的优化。小o应该跟生成的文件名，gcc'smunual里没有这个，如果可以用应该是容错吧，要么就是gcc并没有作相应的优化处理。

sse Use scalar floating point instructions present in the SSE instruction set. This

instruction set is supported by Pentium3 and newer chips, in the AMD line by

Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set

supports only single precision arithmetics, thus the double and extended precision

arithmetics is still done using 387. Later version, present only in Pentium4 and

the future AMD x86-64 chips supports double precision arithmetics too.

For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to

enable SSE extensions and make this option effective. For the x86-64 compiler,

these extensions are enabled by default.

The resulting code should be considerably faster in the majority of cases and avoid

the numerical instability problems of 387 code, but may break some existing code

that expects temporaries to be 80bit.

This is the default choice for the x86-64 compiler.

sse,387

sse+387

both

Attempt to utilize both instruction sets at once. This effectively double the

amount of available registers and on chips with separate execution units for 387 and

SSE the execution resources too. Use this option with care, as it is still

experimental, because the GCC register allocator does not model separate functional

units well resulting in instable performance.

浅谈Gcc4.4.4优化

http://www.cnblogs.com/xunxun1982/archive/2010/06/08/1754067.html

Intel Compiler的编译器默认会加载一些有利于程序运行效率的开关，这也是Intel的编译器领先于其他编译器默认开关的原因之一。其实，作为跨平台的编译器，Gcc在选用恰当的优化选项后，运行效率在某些方面也是堪比Intel Compiler的。
下面仅列举Gcc常用的优化选项。有的含义不作说明，请参看帮助文档。

1、-O系列

（1）-O和-O1

包含下列选项

-fauto-inc-dec-fcprop-registers-fdce-fdefer-pop-fdelayed-branch-fdse-fguess-branch-probability-fif-conversion2-fif-conversion-finline-small-functions-fipa-pure-const-fipa-reference-fmerge-constants -fsplit-wide-types-ftree-builtin-call-dce-ftree-ccp-ftree-ch-ftree-copyrename-ftree-dce-ftree-dominator-opts-ftree-dse-ftree-fre-ftree-sra-ftree-ter-funit-at-a-time-fomit-frame-pointer

(2)-O2
除了加载-O1的选项外，还加载
-fthread-jumps-falign-functions -falign-jumps-falign-loops -falign-labels-fcaller-saves-fcrossjumping-fcse-follow-jumps -fcse-skip-blocks-fdelete-null-pointer-checks-fexpensive-optimizations-fgcse -fgcse-lm-findirect-inlining-foptimize-sibling-calls-fpeephole2-fregmove-freorder-blocks -freorder-functions-frerun-cse-after-loop-fsched-interblock -fsched-spec-fschedule-insns -fschedule-insns2-fstrict-aliasing -fstrict-overflow-ftree-switch-conversion-ftree-pre-ftree-vrp
（3）-O3
除了加载-O2外，还加载
-finline-functions-funswitch-loops’?-fpredictive-commoning-fgcse-after-reload-ftree-vectorize
（4）-Os
为代码尺寸而优化代码。
除了包含-O2的开关外，-Os还会使得下列开关禁用。
-falign-functions -falign-jumps -falign-loops-falign-labels -freorder-blocks -freorder-blocks-and-partition-fprefetch-loop-arrays -ftree-vect-loop-version
另外，对于多个-O选项的情形，最后一个加载的为有效。比如gcc –O1 –Os –O3 –o test test.c，有效的优化开关为-O3。
一般来说，用的最多的是-O3和-Os，如果遇到程序运行不正常的问题，请降低优化级别，如把-O3改为-O2（情况很少见）。

2、针对目标机器
（1）-march=cpu-type
     为cpu-type所针对的机器开启需要的指令集。
     cpu-type可以为pentium4、core2、athlon-4等（具体参见文档），比如-march=core2时，则会开启core2所支持的MMX、SSE、SSE2、SSE3、SSSE3指令集。
     另外还支持native类型，为编译器所在目前的CPU类型优化指令集，指定-march=native。
（2）-mfpmath=unit
     选择浮点运算单元。
     unit可以为387和sse。
    387为x86系列的默认值，使用标准的387浮点协处理器。
    sse为x64的默认值，使用sse指令集。
    一般你的程序如果有大量的浮点运算的话，在P4和K8以上级别的处理器上推荐开启-mfpmath=sse。
（3）加载指定指令集。
     可以使用-msse2、-msse4.1加载指定的指令集。

3、其他比较有效的选项
（1）-ftracer
     执行尾部复制以扩大超级块的尺寸，它简化了函数控制流，从而允许其它的优化措施做的更好。单独使用没啥意义，和其他优化选项一起使用很有效。
（2）-ffast-math
     违反IEEE/ANSI标准以提高浮点数计算速度，是个危险的选项，仅在编译不需要严格遵守IEEE规范且浮点计算密集的程序考虑采用。不考虑精度时使用这个选项速度会加快。
（3）-fivopts
     在trees上执行归纳变量优化。
（4）-ftree-parallelize-loops=n
     使循环并行化。只当循环无数据依赖时使用，在多核CPU上时使用才会有利。
（5）-ftree-loop-linear
     在trees上进行线型循环转换。它能够改进缓冲性能并且允许进行更进一步的循环优化。
（6）-fforce-addr
     必须将地址复制到寄存器中才能对他们进行运算。由于所需地址通常在前面已经加载到寄存器中了，所以这个选项可以改进代码。
（7）-floop-interchange
     交换循环变量。
    例如
DO J = 1, M    DO I = 1, N        A(J, I) = A(J, I) * C    ENDDOENDDO
会改变为
DO I = 1, N    DO J = 1, M        A(J, I) = A(J, I) * C    ENDDOENDDO
    改变后，如果N比缓冲区大的话，会更有效率。这是因为Fortran里数组是以列主元为排列方式的。当然这个选项并不仅仅用于Fortran，Gcc家族的编译器都有效。
（8）-fvisibility=hidden
    设置默认的ELF镜像中符号的可见性为隐藏。使用这个特性可以非常充分的提高连接和加载共享库的性能，生成更加优化的代码，提供近乎完美的API输出和防止符号碰撞。我们强烈建议你在编译任何共享库（Dll）的时候使用该选项。
      -fvisibility-inlines-hidden
    默认隐藏所有内联函数，从而减小导出符号表的大小，既能缩减文件的大小，还能提高运行性能，强烈建议你在编译任何共享库的时候使用该选项。
（9）-minline-all-stringops
    默认时GCC只将确定目的地会被对齐在至少4字节边界的字符串操作内联进程序代码。该选项启用更多的内联并且增加二进制文件的体积，但是可以提升依赖于高速 memcpy, strlen, memset 操作的程序的性能。
（10)-m64
    生成专门运行于64位环境的代码，不能运行于32位环境，仅用于x86_64[含EMT64]环境。
(11)-fprefetch-loop-arrays
    生成数组预读取指令，对于使用巨大数组的程序可以加快代码执行速度，适合数据库相关的大型软件等。具体效果如何取决于代码。不能和-Os一起使用。
（12）-pipe
    在编译过程的不同阶段之间使用管道而非临时文件进行通信，可以加快编译速度。建议使用。

4、推荐选项开关
综上，比较安全的开关为
-pipe -O3(-Os) -march=native -mfpmath=sse -msse2 -ftracer -fivopts -ftree-loop-linear -fforce-addr
如果不需要多高的精度，比如GUI框架之类，加入
-ffast-math

如果是编译的是共享库（.dll，.a）加入
-fvisibility=hidden-fvisibility-inlines-hidden

0 0