#定位系统性能瓶颈# perf

来源:互联网 发布:淘宝突然没有访客了 编辑:程序博客网 时间:2024/05/19 02:19
perf是一个基于Linux 2.6+的调优工具,在liunx性能测量抽象出一套适应于各种不同CPU硬件的通用测量方法,其数据来源于比较新的linux内核提供的 perf_event 接口

系统事件:
perf tool 支持一系列可计算的事件类型。该工具和底层内核接口可以监测来自不同来源的事件。
例如,一些事件是来源于纯粹的内核计数器,这些event在这种情况下被称为软件事件。例子包括:context-switch、minor-fault等。


事件的另一个来源是处理器本身及其性能监控装置(PMU)。它提供了一系列事件来监测 micro-architectural 事件,如cycle的数量、指令退出、L1缓存不命中等等。

这些事件被称为PMU硬件事件或硬件事件。每种处理器型号都有各种不同的硬件事件类型。

perf_events接口还提供了一组通用的硬件事件名称。
在每个处理器,这些事件映射到实际的CPU提供的实际存在的事件类型,如果不存在则不能使用。
而且这些也被称为硬件事件(hardware events)和硬件缓存事件(hardware cache events)。

最后,还有各种 tracepoint事件是由内核ftrace框架实现的,而且这些事件都是2.6.3x或更加新的内核才支持。

Tracepoints:
Tracepoint 是散落在内核源代码中的一些 hook,一旦使能,它们便可以在特定的代码被运行到时被触发,这一特性可以被各种 trace/debug 工具所使用。Perf 就是该特性的用户之一。
假如您想知道在应用程序运行期间,内核内存管理模块的行为,便可以利用潜伏在 slab 分配器中的 tracepoint。当内核运行到这些 tracepoint 时,便会通知 perf。
Perf 将 tracepoint 产生的事件记录下来,生成报告,通过分析这些报告,调优人员便可以了解程序运行时期内核的种种细节,对性能症状作出更准确的诊断。

常用perf工具:
perf list:列出所有事件类型
perf top:类似top的全局监控工具
perf stat:对某个程序或者特定进程进行监控

perf record & perf report:先记录再通过详细报告的形式打印事件信息


  • perf list

[root@localhost test]# perf list

List of pre-defined events (to be used in -e):  cpu-cycles OR cycles                               [Hardware event]  instructions                                       [Hardware event]  cache-references                                   [Hardware event]  cache-misses                                       [Hardware event]  branch-instructions OR branches                    [Hardware event]  branch-misses                                      [Hardware event]  bus-cycles                                         [Hardware event]  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]  ref-cycles                                         [Hardware event]  cpu-clock                                          [Software event]  task-clock                                         [Software event]  page-faults OR faults                              [Software event]  context-switches OR cs                             [Software event]  cpu-migrations OR migrations                       [Software event]  minor-faults                                       [Software event]  major-faults                                       [Software event]  alignment-faults                                   [Software event]  emulation-faults                                   [Software event]  L1-dcache-loads                                    [Hardware cache event]  L1-dcache-load-misses                              [Hardware cache event]  L1-dcache-stores                                   [Hardware cache event]  L1-dcache-store-misses                             [Hardware cache event]  L1-dcache-prefetches                               [Hardware cache event]  L1-dcache-prefetch-misses                          [Hardware cache event]  L1-icache-loads                                    [Hardware cache event]  L1-icache-load-misses                              [Hardware cache event]  L1-icache-prefetches                               [Hardware cache event]  L1-icache-prefetch-misses                          [Hardware cache event]  LLC-loads                                          [Hardware cache event]  LLC-load-misses                                    [Hardware cache event]  LLC-stores                                         [Hardware cache event]  LLC-store-misses                                   [Hardware cache event]  LLC-prefetches                                     [Hardware cache event]  LLC-prefetch-misses                                [Hardware cache event]  dTLB-loads                                         [Hardware cache event]  dTLB-load-misses                                   [Hardware cache event]  dTLB-stores                                        [Hardware cache event]  dTLB-store-misses                                  [Hardware cache event]  dTLB-prefetches                                    [Hardware cache event]  dTLB-prefetch-misses                               [Hardware cache event]  iTLB-loads                                         [Hardware cache event]  iTLB-load-misses                                   [Hardware cache event]  branch-loads                                       [Hardware cache event]  branch-load-misses                                 [Hardware cache event]  rNNN                                               [Raw hardware event descriptor]  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]    sunrpc:rpc_call_status                             [Tracepoint event]  sunrpc:rpc_bind_status                             [Tracepoint event]  sunrpc:rpc_connect_status                          [Tracepoint event]  sunrpc:rpc_task_begin                              [Tracepoint event]  sunrpc:rpc_task_run_action                         [Tracepoint event]  sunrpc:rpc_task_complete                           [Tracepoint event]  sunrpc:rpc_task_sleep                              [Tracepoint event]  sunrpc:rpc_task_wakeup                             [Tracepoint event]  ext4:ext4_free_inode                               [Tracepoint event]  ext4:ext4_request_inode                            [Tracepoint event]  ext4:ext4_allocate_inode                           [Tracepoint event]  ext4:ext4_write_begin                              [Tracepoint event]  ext4:ext4_da_write_begin                           [Tracepoint event]省略............

其中通过perf list命令运行在不同的系统会列出不同的结果,在 2.6.35 版本的内核中,该列表已经相当的长,但无论有多少,我们可以将它们划分为三类:

Hardware Event 是由 PMU 硬件产生的事件,比如 cache 命中,当您需要了解程序对硬件特性的使用情况时,便需要对这些事件进行采样;
Software Event 是内核软件产生的事件,比如进程切换,tick 数等 ;
Tracepoint event 是内核中的静态 tracepoint 所触发的事件,这些 tracepoint 用来判断程序运行期间内核的行为细节,比如 slab 分配器的分配次数等。


在操作系统运行过程中,关于系统调用的调度优先级别,从高到低是这样的:
硬中断->软中断->实时进程->内核进程->用户进程

Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request(硬中断)
Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.(软中断)
Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.(实时进程,会抢占其他非实时进程时间片)
Kernel (System) Processes – All kernel processing is handled at this level of priority. (内核进程)
User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.(用户进程)


  • perf top

Samples: 475  of event 'cycles', Event count (approx.): 165419249 10.30%  [kernel]          [k] kallsyms_expand_symbol  7.75%  perf              [.] 0x000000000005f190  5.68%  libc-2.12.so      [.] memcpy  5.45%  [kernel]          [k] format_decode  5.45%  libc-2.12.so      [.] __strcmp_sse42  5.45%  perf              [.] symbols__insert  5.24%  [kernel]          [k] memcpy  4.86%  [kernel]          [k] vsnprintf  4.62%  [kernel]          [k] string  4.45%  [kernel]          [k] number  3.35%  libc-2.12.so      [.] __strstr_sse42  3.15%  perf              [.] hex2u64  2.72%  perf              [.] rb_next  2.10%  [kernel]          [k] pointer  1.89%  [kernel]          [k] security_real_capable_noaudit  1.89%  [kernel]          [k] strnlen  1.88%  perf              [.] rb_insert_color  1.68%  libc-2.12.so      [.] _int_malloc  1.68%  libc-2.12.so      [.] _IO_getdelim  1.28%  [kernel]          [k] update_iter  1.05%  [kernel]          [k] s_show  1.05%  libc-2.12.so      [.] memchr  1.05%  [kernel]          [k] get_task_cred  0.88%  [kernel]          [k] seq_read  0.85%  [kernel]          [k] clear_page_c  0.84%  perf              [.] symbol__new  0.63%  libc-2.12.so      [.] __libc_calloc  0.63%  [kernel]          [k] copy_user_generic_string  0.63%  [kernel]          [k] seq_vprintf  0.63%  perf              [.] kallsyms__parse  0.63%  libc-2.12.so      [.] _IO_feof  0.63%  [kernel]          [k] strcmp  0.63%  [kernel]          [k] page_fault  0.63%  perf              [.] dso__load_sym  0.42%  perf              [.] strstr@plt  0.42%  libc-2.12.so      [.] __strchr_sse42  0.42%  [kernel]          [k] seq_printf  0.42%  libc-2.12.so      [.] __memset_sse2  0.42%  libelf-0.152.so   [.] gelf_getsym  0.25%  [kernel]          [k] __mutex_init  0.21%  [kernel]          [k] _spin_lock  0.21%  [kernel]          [k] s_next  0.21%  [kernel]          [k] fsnotify  0.21%  [kernel]          [k] sys_read  0.21%  [kernel]          [k] __link_path_walk  0.21%  [kernel]          [k] intel_pmu_disable_all  0.21%  [kernel]          [k] mem_cgroup_charge_common  0.21%  [kernel]          [k] update_shares  0.21%  [kernel]          [k] native_read_tsc  0.21%  [kernel]          [k] apic_timer_interrupt  0.21%  [kernel]          [k] auditsys  0.21%  [kernel]          [k] do_softirq  0.21%  [kernel]          [k] update_process_times  0.21%  [kernel]          [k] native_sched_clock  0.21%  [kernel]          [k] mutex_lock  0.21%  [kernel]          [k] module_get_kallsym
perf top 类似top的作用,可以整体查看系统的全部事件并且按数量排序。

  • perf stat

perf stat可以直接接上命令或者通过pid形式对进程进行性能统计,完成后会打印出这个过程的性能数据统计

[root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000记录了1000000+0 的读入记录了1000000+0 的写出512000000字节(512 MB)已复制,0.790059 秒,648 MB/秒 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':        791.176286 task-clock                #    1.000 CPUs utilized                           1 context-switches          #    0.001 K/sec                                   0 cpu-migrations            #    0.000 K/sec                                 247 page-faults               #    0.312 K/sec                       1,248,519,452 cycles                    #    1.578 GHz                     [83.29%]       366,166,452 stalled-cycles-frontend   #   29.33% frontend cycles idle    [83.33%]       155,120,002 stalled-cycles-backend    #   12.42% backend  cycles idle    [66.62%]     1,947,919,494 instructions              #    1.56  insns per cycle                                                     #    0.19  stalled cycles per insn [83.31%]       355,465,524 branches                  #  449.287 M/sec                   [83.41%]         2,021,648 branch-misses             #    0.57% of all branches         [83.42%]       0.791116595 seconds time elapsed

通过pid指定某进程进行性能监测,按ctrl-C结束,然后计算这段时间内的性能统计

[root@localhost test]# perf stat -p 18669^C Performance counter stats for process id '18669':          1.520699 task-clock                #    0.001 CPUs utilized                          56 context-switches          #    0.037 M/sec                                   0 cpu-migrations            #    0.000 K/sec                                   0 page-faults               #    0.000 K/sec                           2,178,120 cycles                    #    1.432 GHz                     [63.18%]         1,410,393 stalled-cycles-frontend   #   64.75% frontend cycles idle    [90.94%]           942,665 stalled-cycles-backend    #   43.28% backend  cycles idle            1,067,824 instructions              #    0.49  insns per cycle                                                     #    1.32  stalled cycles per insn           193,104 branches                  #  126.984 M/sec                              14,544 branch-misses             #    7.53% of all branches         [61.93%]       2.061889979 seconds time elapsed

[root@localhost test]# perf stat usage: perf stat [<options>] [<command>]    -e, --event <event>   event selector. use 'perf list' to list available events        --filter <filter>                          event filter    -i, --no-inherit      child tasks do not inherit counters    -p, --pid <pid>       stat events on existing process id    -t, --tid <tid>       stat events on existing thread id    -a, --all-cpus        system-wide collection from all CPUs    -g, --group           put the counters into a counter group    -c, --scale           scale/normalize counters    -v, --verbose         be more verbose (show counter open errors, etc)    -r, --repeat <n>      repeat command and print average + stddev (max: 100)    -n, --null            null run - dont start any counters    -d, --detailed        detailed run - start a lot of events    -S, --sync            call sync() before starting a run    -B, --big-num         print large numbers with thousands' separators    -C, --cpu <cpu>       list of cpus to monitor in system-wide    -A, --no-aggr         disable CPU count aggregation    -x, --field-separator <separator>                          print counts with custom separator    -G, --cgroup <name>   monitor event in cgroup name only    -o, --output <file>   output file name        --append          append to the output file        --log-fd <n>      log output to fd, instead of stderr        --pre <command>   command to run prior to the measured command        --post <command>  command to run after to the measured command

perf stat 还有其他有用的选项如 -e 是匹配指定事件

perf stat -e cycles,instructions,cache-misses [...]
也可以指定监控的级别

perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000
具体含义是:

ModifiersDescriptionExampleumonitor at priv level 3, 2, 1 (user)event:ukmonitor at priv level 0 (kernel)event:khmonitor hypervisor events on a virtualization environmentevent:hHmonitor host machine on a virtualization environmentevent:HGmonitor guest machine on a virtualization environmentevent:G

这些修饰符其实相当于flag标记的布尔类型,u是用户态,k是内核态,后面三个是和虚拟化相关的。


操作系统资源一般分为四大部分:CPU、内存、IO、网络

容易出现瓶颈的一般是CPU和IO,所以我这里写了两个小程序分别模拟这两种情况,然后通过使用perf一系列的工具找到症结所在。

程序1是计算素数,专门用来耗系统CPU的

#include <stdio.h>int IsPrime(int num){ int i=2; for(;i<=num/2;i++)  if(0==num%i)   return 0; return 1;}void main(){ int num; for(num=2;num<=10000000;num++)    IsPrime(num);    }

程序2是开n个线程,每隔1ms写1K的数据到文件,读1k的数据。用来耗系统IO

#include <unistd.h>#include <stdio.h>#include <string>#include <sys/types.h>#include <sys/wait.h>#include <sys/stat.h>#include <fcntl.h>#include <stdlib.h>#include <vector>using namespace std;bool gquit = false;void* threadfun(void* param){    string filename = *(string*)param;    unlink(filename.c_str());        int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666);    if (fd < 0) return 0;    string a(1024*16*1024, 'a');    bool flag = true;    while (!gquit)    {        if (flag)        {            write(fd, a.c_str(), a.length());        }        else        {            read(fd, &a[0], a.length());                lseek(fd, 0, SEEK_SET);        }        flag = !flag;        usleep(1);    }    close(fd);    return 0;}int main(int argc, char* argv[]){    if (argc != 2)    {        printf("need thread num\n");        return 0;    }            int threadnum = atoi(argv[1]);    vector<string> v(threadnum, "./testio.filex_");    vector<pthread_t> vpid(threadnum, 0);    for(int i = 0 ; i < threadnum ; ++ i)    {        pthread_attr_t attr;        pthread_attr_init(&attr);        static char buf[32];        snprintf(buf, sizeof(buf), "%d", i);        v[i] += buf;        pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]);             }    printf("press q to quit\n");            while(getchar() != 'q');    gquit = true;    for(int i = 0 ; i < threadnum ; ++ i)    {        void* ret = NULL;        pthread_join(vpid[i], &ret);            unlink(v[i].c_str());    }       return 0;}

我先运行程序1后,通过perf top进行全局监控

Samples: 1K of event 'cycles', Event count (approx.): 690036067 74.14%  cpu               [.] IsPrime  2.45%  [kernel]          [k] kallsyms_expand_symbol  2.21%  perf              [.] 0x000000000005f225  1.65%  perf              [.] symbols__insert  1.50%  perf              [.] hex2u64  1.41%  libc-2.12.so      [.] __strcmp_sse42  1.26%  [kernel]          [k] string  1.25%  [kernel]          [k] format_decode  1.10%  [kernel]          [k] vsnprintf  0.95%  [kernel]          [k] strnlen  0.85%  [kernel]          [k] memcpy  0.80%  libc-2.12.so      [.] memcpy  0.70%  [kernel]          [k] number  0.60%  perf              [.] rb_next  0.60%  libc-2.12.so      [.] _int_malloc  0.55%  libc-2.12.so      [.] __strstr_sse42  0.55%  [kernel]          [k] pointer  0.55%  perf              [.] rb_insert_color

可以看出排名第一是cpu这个进程的IsPrime函数占用了最多的CPU性能

然后通过perf stat 指定pid进行监控

[root@localhost test]# perf stat -p 10704^C Performance counter stats for process id '10704':       4067.488066 task-clock                #    1.001 CPUs utilized           [100.00%]                 5 context-switches          #    0.001 K/sec                   [100.00%]                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]                 0 page-faults               #    0.000 K/sec                      10,743,902,971 cycles                    #    2.641 GHz                     [83.32%]     5,863,649,174 stalled-cycles-frontend   #   54.58% frontend cycles idle    [83.32%]     1,347,218,352 stalled-cycles-backend    #   12.54% backend  cycles idle    [66.69%]    14,625,550,503 instructions              #    1.36  insns per cycle                                                     #    0.40  stalled cycles per insn [83.34%]     1,950,981,334 branches                  #  479.653 M/sec                   [83.35%]            46,273 branch-misses             #    0.00% of all branches         [83.32%]       4.064800952 seconds time elapsed

可以看到各种系统事件的统计信息

然后使用perf record和report进行更加具体的分析

[root@localhost test]# perf record -g -- ./cpu    ^C[ perf record: Woken up 4 times to write data ][ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ]./cpu: Interrupt[root@localhost test]# perf reportSamples: 11K of event 'cycles', Event count (approx.):6815943846+ 99.70% cpu cpu [.] IsPrime+ 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt+ 0.03% cpu cpu [.] main+ 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe+ 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore+ 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context+ 0.01% cpu [kernel.kallsyms] [k] rb_insert_color+ 0.01% cpu [kernel.kallsyms] [k] raise_softirq+ 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end+ 0.01% cpu [kernel.kallsyms] [k] scheduler_tick+ 0.01% cpu [kernel.kallsyms] [k] do_timer+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave+ 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt+ 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock+ 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer+ 0.01% cpu [kernel.kallsyms] [k] idle_cpu

非常清晰可以看到各种函数的调用情况,由于这个程序是个很简单的代码,所以容易看出问题所在,但是如果是大项目的某个功能出现问题,通过perf工具就可以很方便地定位问题,后面再看看一个占用IO的例子。

同样也是先启动程序2,先通过perf top、perf stat -p pid这两个命令先定位是哪个进程导致性能问题

Samples: 394K of event 'cycles', Event count (approx.): 188517759438 75.39%  [kernel]             [k] _spin_lock 11.39%  [kernel]             [k] copy_user_generic_string  1.89%  [jbd2]               [k] jbd2_journal_stop  1.66%  [kernel]             [k] find_get_page  1.51%  [jbd2]               [k] start_this_handle  0.95%  [ext4]               [k] ext4_da_write_end  0.76%  [kernel]             [k] iov_iter_fault_in_readable  0.74%  [ext4]               [k] ext4_journal_start_sb  0.56%  [jbd2]               [k] __jbd2_log_space_left  0.44%  [ext4]               [k] __ext4_journal_stop  0.44%  [kernel]             [k] __block_prepare_write  0.41%  [kernel]             [k] __wake_up_bit  0.35%  [kernel]             [k] _cond_resched  0.32%  [kernel]             [k] kmem_cache_free

可以看到很多 spin lock的事件和各种文件操作事件


直接通过perf record和report进行分析,得出结果

[root@localhost test]# perf record -g -- ./io 10press q to quitq[ perf record: Woken up 83 times to write data ][ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ][root@localhost test]# perf reportSamples: 138K of event 'cycles', Event count (approx.): 64858537028+  65.87%  io  [kernel.kallsyms]    [k] _spin_lock+  14.19%  io  [kernel.kallsyms]    [k] copy_user_generic_string+   2.74%  io  [jbd2]               [k] jbd2_journal_stop+   2.46%  io  [kernel.kallsyms]    [k] find_get_page+   2.11%  io  [jbd2]               [k] start_this_handle+   1.06%  io  [kernel.kallsyms]    [k] iov_iter_fault_in_readable+   1.05%  io  [ext4]               [k] ext4_journal_start_sb+   0.85%  io  [ext4]               [k] ext4_da_write_end+   0.73%  io  [jbd2]               [k] __jbd2_log_space_left+   0.66%  io  [kernel.kallsyms]    [k] generic_write_end+   0.61%  io  [ext4]               [k] __ext4_journal_stop+   0.57%  io  [kernel.kallsyms]    [k] __block_prepare_write+   0.54%  io  [kernel.kallsyms]    [k] __wake_up_bit


好的,以上就是perf常用的工具,当然perf不止这几个工具,但这几个是最基础最常用的工具

通过查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 可以了解更多信息





0 0
原创粉丝点击