性能优化

来源：互联网发布：北京学游戏编程编辑：程序博客网时间：2024/06/03 13:09

1 性能测试工具选型

性能测试工具有gperf、gperftools、oprofile、intel vtune amplifier 等。Gperf是GNU自带工具，可以通过编译的时候加-pg加载进去，缺点是不能测试动态库。Oprofile是免费工具，一般Linux系统自带，嵌入到内核中，缺点是不能再虚拟机上运行。Gperftools是Google出的工具，主要提供高性能内存管理，性能测试只是4个主要功能中的一个，缺点是需要再链接的加入gperftools的库。intel vtune amplifier是商用软件，站在一个正规软件公司的角度，在没有购买到授权前，暂不考虑使用。几个工具实现的原理可以参考https://www.cnblogs.com/likwo/archive/2012/12/20/2826988.html。

C++Profiler工具精确度对动态库的支持对动态控制的支持二次开发和维护成本对虚拟机支持 GUI 多线程支持 GUN profile 较高，对函数执行次数的统计是100%正确的，但是对函数执行时间的统计是通过采样平率估算的，存在一定的偏差。不支持编译时决定，灵活性较差代码集成在glibc中，二次开发和修改的影响面较大，而且发布不易。支持差不支持 Google performance tools 一般，对函数次数和执行时间的统计都是通过采样频率估算的，存在一定的偏差和遗漏。支持运行时控制，更方面操作独立的第三方库，开源项目，二次开发和维护成本较低。支持好支持（需Linux 2.6及以上版本） Oprofile 待调查支持待调查待调查不支持（需配置echo “options oprofile timer=1” >> /etc/modprobe.conf后重启虚拟机）差待调查 intel vtune amplifier 待调查待调查待调查待调查待调查好待调查

我们的项目使用了大量的动态库，并且在虚拟机上运行，所以选择使用gperftools。

2 性能测试工具安装

2.1 下载

Gperftools是开源的工具，源代码在https://github.com/gperftools/gperftools。可以使用git下载或者在releases标签下直接下载。建议在releases标签下下载，会包含configure，可以自动生成Makefile。如果直接下载的源码，可以使用automake、autoconf生成configure文件。

2.2 安装

使用以下命令安装

.configuremakesudo make install

默认的安装库路径为/usr/local/lib。

2.2.1 64位系统安装

安装时可能出现configure: WARNING: No frame pointers and no libunwind. Using experimental backtrace capturing via libgcc. Expect crashy cpu profiler。是因为没有安装libunwind。在gperftools工具的INSTLL例有说明，64位系统需要安装。使用sudo apt-cache search libunwind查找，然后选择需要的安装。

$ sudo apt-cache search libunwindlibunwind-setjmp0 - libunwind-based non local goto - runtimelibunwind-setjmp0-dbg - libunwind-based non local goto - runtimelibunwind-setjmp0-dev - libunwind-based non local goto - developmentlibunwind8 - library to determine the call-chain of a program - runtimelibunwind8-dbg - library to determine the call-chain of a program - runtimelibunwind8-dev - library to determine the call-chain of a program – development$ sudo apt-get install libunwind8-dev

2.2.2 64操作系统安装32位库

linux32 ./configure "CFLAGS=-m32 -g -O2" "CXXFLAGS=-m32 -g -O2" LDFLAGS=-m32makesudo make install

3 使用

3.1使用方法

Gperftools的使用方法有三种：

3.1.1 直接调用提供的api

这种方式比较适用于对于程序的某个局部来做分析的情况，直接在要做分析的局部调用相关的api即可。
方式：调用函数：ProfilerStart() and ProfilerStop()。可以将这两个函数封装到两个信号处理函数中，这样可以给被测进程发送信号1，就开始profile，给被测进程发信号2，就停止profile。这种方式适用于服务器程序，因为服务器程序不会自动退出，适用ctrl+c退出也不是正常的exit(0)的方式退出，会导致在profile的时候，收集到的数据不全甚至是空的。

#include <stdio.h>#include <sys/types.h>#include <unistd.h>#include <signal.h>#include < gperftools/profiler.h>//SIGUSR1: start profiling//SIGUSR2: stop profilingstatic void gprof_callback(int signum){    if (signum == SIGUSR1)     {        printf("Catch the signal ProfilerStart\n");        ProfilerStart("li.prof");    }     else if (signum == SIGUSR2)     {        printf("Catch the signal ProfilerStop\n");        ProfilerStop();    }}static void setup_signal(){    struct sigaction profstat;    profstat.sa_handler = gprof_callback;    profstat.sa_flags = 0;    sigemptyset(&profstat.sa_mask);                                            sigaddset(&profstat.sa_mask, SIGUSR1);    sigaddset(&profstat.sa_mask, SIGUSR2);    if ( sigaction(SIGUSR1, &profstat,NULL) < 0 )     {       fprintf(stderr, "Fail to connect signal SIGUSR1 with start profiling");    }    if ( sigaction(SIGUSR2, &profstat,NULL) < 0 )     {        fprintf(stderr, "Fail to connect signal SIGUSR2 with stop profiling");    }}

启动程序，可以采用kill -s SIGUSR1 pid和kill -s SIGUSR2 pid来开始采集和停止采集。

3.1.1.1 调整采样时间

默认采样时间是10ms，可以通过修改CPUPROFILE_FREQUENCY来调整采用时间，单位是interrupts/second。例如调整为1ms一次，在shell下输入

env CPUPROFILE_FREQUENCY=1000

3.1.2 链接静态库

这种方式是最为常用的方式。一般运行一段时间后自动结束的程序采用这种方法。
方式：在代码link过程中添加参数 –lprofiler
运行程序：env CPUPROFILE=./helloworld.prof ./helloworld。表示指定要profile的程序为helloworld，并且指定产生的分析结果文件的路径为./helloworld.prof

3.1.3 链接动态库

这种方式和静态库的方式差不多。
方式：在代码link过程中添加参数 –lprofiler 。默认使用动态库，除非在链接是用-Wl,-Bstatic指定使用静态库
运行时使用LD_PRELOAD，e.g. % env LD_PRELOAD="/usr/lib/libprofiler.so" <binary>。

3.2 查看结果

3.2.1 结果文件位置及解析工具

生成的结果文件默认在待测试程序所在的目录，文件名为li.prof。如果需要修改在3.1.1中提供的代码里修改文件生成路径及名称。
查看profile结果：pprof工具，它是一个perl的脚本，通过这个工具，可以将google-perftool的输出结果分析得更为直观，输出为图片、pdf等格式。Pprof工具在安装了Gperftools后自带，如果在目标机中没有该工具，请将/usr/local/bin/pprof拷贝到目标机中。
调用pprof分析数据文件：

% pprof "program" "profile"  Generates one line per procedure% pprof --gv "program" "profile"  Generates annotated call-graph and displays via "gv"% pprof --gv --focus=Mutex "program" "profile"  Restrict to code paths that involve an entry that matches "Mutex"% pprof --gv --focus=Mutex --ignore=string "program" "profile"  Restrict to code paths that involve an entry that matches "Mutex"  and does not match "string"% pprof --list=IBF_CheckDocid "program" "profile"  Generates disassembly listing of all routines with at least one  sample that match the --list= pattern.  The listing is  annotated with the flat and cumulative sample counts at each line.% pprof --disasm=IBF_CheckDocid "program" "profile"  Generates disassembly listing of all routines with at least one  sample that match the --disasm= pattern.  The listing is  annotated with the flat and cumulative sample counts at each PC value.

更具体的方法请参照gperftools的文档http://goog-perftools.sourceforge.net/doc/cpu_profiler.html。

3.2.2 文本方式查看

先通过pprof生成可读的文本文件，方法如下：

% pprof --test "program" "profile"

使用pprof –text生成的报告，文本输出风格如下：

Total: 492 samples     180  36.6%  36.6%      406  82.5% dpdk_send_packets     163  33.1%  69.7%      163  33.1% eth_igb_xstats_get      63  12.8%  82.5%       63  12.8% eth_igbvf_xstats_get      52  10.6%  93.1%       52  10.6% e1000_write_phy_reg_bm2      14   2.8%  95.9%       14   2.8% 0xf76fbc90      10   2.0%  98.0%       10   2.0% dpdk_mcore_bcmc_loop

每列的含义如下：

列含义第一列分析样本数量（不包含其他函数调用）第二列分析样本百分比（不包含其他函数调用）第三列目前为止的分析样本百分比（不包含其他函数调用）第四列分析样本数量（包含其他函数调用）第五列分析样本百分比（包含其他函数调用）第六列函数名

3.2.3 图形方式查看（推荐）

先通过pprof生成可读的图形文件，方法如下：

% pprof --callgrind "program" "profile" > callgrind.res

图形方式有多种工具可以查看
1. Kcachegrind Linux下软件，可以直接查看生成的图形
2. windows port of kcachegrind 由原linux的kcachegrind，重新编译在windows上可执行版，功能与linux kcachegrind相同。
3. WinCacheGrind Windows下简易版的kcachegrind，可分析由xdebug生成的cachegrind.xxx文件
4. Webgrind 网页版的callgrind，搭配xdebug可做实时在线做php script profile。
生成的图形大同小异，此处不再介绍这个工具的使用方法，请自行参考官方网站文档。windows port of kcachegrind生成的结果如下图所示：
windows port of kcachegrind界面

4 使用实例

1．在代码中新建一个源文件，采用3.1.1提供的代码。在初始化的代码中调用setup_signal()。

void li_common_init(){    setup_signal();    return;}

2．在编译整个程序的Makefile链接位置添加–lprofiler。

$(usr_app):$(CXX) -m32 -rdynamic -Wl,-rpath,./ $(LIBPATH) $^ -o $(top_dir)/../lib/bin/$@ -lappl -lchip -lbfd -lpmalm -lvirtualboard -lXelApp -lnetdev_fwd -ltestsim -lfpgasim -ltestcommon -llicom -lprofiler -Wl,--whole-archive -Wl,-ldpdk -Wl,--no-whole-archive -lbmu -lDBEPR820x86 -lm -lrt -ldl  -lpthread -lutil -laps -lcrypto #-lncurses$(usr_app2):$(CXX) -m32 -rdynamic -Wl,-rpath,./ $(LIBPATH) $^ -o $(top_dir)/../lib/bin/$@ -lappl -lchip -lbfd -lpmalm -lvirtualboard -lXelApp -lnetdev_fwd-atom -ltestsim -lfpgasim -ltestcommon -llicom -lprofiler -Wl,--whole-archive -Wl,-ldpdk-atom -Wl,--no-whole-archive -lbmu -lDBEPR820x86 -lm -lrt -ldl  -lpthread -lutil -laps -lcrypto

将libprofiler.so.0.4.14拷贝到bin/lib目录下，并在该目录下调用以下指令生成软连接：

sudo ln -s libprofiler.so.0.4.14 libprofiler.so.0sudo ln -s libprofiler.so.0 libprofiler.so

make生成程序整包，并在设备上升级。本例生成的可执行程序名为XCU_R820.out。
3．运行上一步生成的程序。查询程序的PID

user@ubuntu:~$ ps aux |grep outroot      2973  296  2.1 1922348 353152 ?      Sl   15:47  11:24 XCU_R820.outuser      3145  0.0  0.0  11764  2152 pts/2    S+   15:51   0:00 grep --color=auto out

在待测试点，shell下输入

user@ubuntu:~$ sudo kill -s SIGUSR1 2973user@ubuntu:~$ sudo kill -s SIGUSR2 2973

此时，在程序所在目录生成了性能分析文件li.prof
4．将pprof工具拷贝到设备/usr/local/bin/目录下
5．在shell下输入以下指令，生成可读报告。

root@ubuntu:~# pprof --text /home/xcu/lib/bin/XCU_R820.out /home/xcu/lib/bin/li.prof Using local file /home/xcu/lib/bin/XCU_R820.out.Using local file /home/xcu/lib/bin/li.prof.Total: 4187 samples    1379  32.9%  32.9%     3367  80.4% rte_memcpy (inline)     888  21.2%  54.1%      888  21.2% e1000_setup_copper_link_ich8lan     790  18.9%  73.0%      790  18.9% ixgbe_check_mac_link_82598     514  12.3%  85.3%      514  12.3% e1000_phy_is_accessible_pchlan     469  11.2%  96.5%      469  11.2% e1000_setup_link_ich8lan     111   2.7%  99.1%      115   2.7% _mm_storeu_si128 (inline)       8   0.2%  99.3%        8   0.2% 0xf7fdbc90       2   0.0%  99.4%        2   0.0% CheckFifoSpace       2   0.0%  99.4%        2   0.0% __nss_hosts_lookup       2   0.0%  99.5%        2   0.0% __pthread_cond_signal       2   0.0%  99.5%        2   0.0% _mm_loadu_si128 (inline)       2   0.0%  99.6%        2   0.0% osa_semGive_x       1   0.0%  99.6%        1   0.0% 0xf7fdbc8e       1   0.0%  99.6%        1   0.0% ApsModTeFrrConf       1   0.0%  99.6%        1   0.0% BmuLog       1   0.0%  99.7%        1   0.0% BmuTaskDiag       1   0.0%  99.7%        1   0.0% GetCheckSum       1   0.0%  99.7%        1   0.0% _ApiSearchIntfVlanrange       1   0.0%  99.7%        1   0.0% _ApiSearchNve       1   0.0%  99.8%        1   0.0% __isoc99_vfwscanf       1   0.0%  99.8%        2   0.0% app_clk_handler       1   0.0%  99.8%        1   0.0% fhdrv_psn_get_vxlan_tx_counter       1   0.0%  99.8%        1   0.0% ixgbe_get_media_type_82598       1   0.0%  99.9%        2   0.0% li_log_file       1   0.0%  99.9%        1   0.0% line_card_state_timer       1   0.0%  99.9%        1   0.0% oam_fpga_16bits_read       1   0.0%  99.9%        1   0.0% osa_taskDelay       1   0.0% 100.0%        1   0.0% osa_taskDelayInSys       1   0.0% 100.0%        1   0.0% osa_taskLockCancel       1   0.0% 100.0%        1   0.0% tzset       0   0.0% 100.0%        2   0.0% 0xe85d781f       0   0.0% 100.0%        1   0.0% AnalyzeFrame       0   0.0% 100.0%        1   0.0% ApiAddIntf       0   0.0% 100.0%        3   0.1% BmuExceptionTask       0   0.0% 100.0%        2   0.0% BmuProcessTimerList       0   0.0% 100.0%      803  19.2% BmuPthreadBoot       0   0.0% 100.0%        3   0.1% DivConfProcess       0   0.0% 100.0%        2   0.0% ExecConfig       0   0.0% 100.0%        1   0.0% HighGather       0   0.0% 100.0%        1   0.0% HighGatherEntry       0   0.0% 100.0%        2   0.0% LowGather       0   0.0% 100.0%        2   0.0% LowGatherEntry       0   0.0% 100.0%        2   0.0% ProcCmd       0   0.0% 100.0%        3   0.1% ProcCmdEntry       0   0.0% 100.0%        2   0.0% ProcUsrCmd       0   0.0% 100.0%        1   0.0% UnCompressCmdData       0   0.0% 100.0%     3363  80.3% X86ReadMeterTbl       0   0.0% 100.0%        4   0.1% X86WriteARPMissTbl       0   0.0% 100.0%     3355  80.1% X86WriteMacHashTbl       0   0.0% 100.0%        1   0.0% _ApiAddIntf       0   0.0% 100.0%       16   0.4% __mempool_generic_put (inline)       0   0.0% 100.0%       16   0.4% __rte_mbuf_raw_free (inline)       0   0.0% 100.0%      790  18.9% bfd_fsm_state_up       0   0.0% 100.0%      790  18.9% bfdtask       0   0.0% 100.0%      808  19.3% clone       0   0.0% 100.0%        1   0.0% del_bfd_bind_info_ip       0   0.0% 100.0%      790  18.9% dpdk_mcore_rcv_minm_loop       0   0.0% 100.0%        1   0.0% fhdrv_psn_get_vxlan_rx_counter       0   0.0% 100.0%      790  18.9% finish_change_fpga_bfd_cfg_by_index       0   0.0% 100.0%      790  18.9% finish_fpga_bfd_cfg_by_index       0   0.0% 100.0%        2   0.0% fpgasim_find_bfd       0   0.0% 100.0%        3   0.1% fpgasim_receive_bfd_packet       0   0.0% 100.0%        3   0.1% fpgasim_recv_bfd_packet       0   0.0% 100.0%        3   0.1% fpgasim_send_bfd_packet       0   0.0% 100.0%        3   0.1% fpgasim_send_packet_to_tunnel       0   0.0% 100.0%      790  18.9% fsm       0   0.0% 100.0%        1   0.0% get_fpga_bfd_bfd_discr       0   0.0% 100.0%        3   0.1% intf_cfg_process       0   0.0% 100.0%        3   0.1% ixgbe_setup_mac_link_82598       0   0.0% 100.0%        2   0.0% li_hash_get       0   0.0% 100.0%        1   0.0% read_fpga_mmap       0   0.0% 100.0%       16   0.4% rte_mempool_generic_put (inline)       0   0.0% 100.0%       16   0.4% rte_mempool_put (inline)       0   0.0% 100.0%       16   0.4% rte_mempool_put_bulk (inline)       0   0.0% 100.0%       16   0.4% rte_pktmbuf_detach       0   0.0% 100.0%        1   0.0% spm_o

6．在shell下生成图形文件。

pprof --callgrind /home/xcu/lib/bin/XCU_R820.out /home/xcu/lib/bin/li.prof >callgrind.res

在Windows下用windows port of kcachegrind打开，生成结果如下。然后根据具体问题修改。
这里写图片描述

7．性能优化
通过性能测试结果，发现e1000_setup_copper_link_ich8lan占用的时间最长，该函数的主要功能是获取端口状态。所以修改成定时调用该接口获取端口状态，然后将获取到的结果保存到内存中，原来代用e1000_setup_copper_link_ich8lan接口的地方改从内存中获取数据。

阅读全文

0 0