Erlang Trace Tool Builder心得

来源：互联网发布：制影集的软件编辑：程序博客网时间：2024/05/17 05:07

原创翻译文章，转载请注明出处http://blog.csdn.net/erlib 作者Sunface

在学习的过程中顺便把Tracer翻译出来，现跟大家分享

1.Trace Tool Builder（erlang跟踪工具）主要特性：

用一个函数调用来跟踪多个节点上的文件端口
可以往在formatting过程中读取的记录跟踪信息的文件中写入在额外的信息
通过维护历史缓存和处理配置文件来复原以前版本的配置
对于连续的跟踪提供了一些简单的支持
格式化输出二进制格式跟踪日志和合并多个节点上的日志

2.ttb 模块

ttb模块是Trace Tool 的使用接口，首先你要启动跟踪：ttb:tracer/0/1/2，对需要跟踪的进程设置跟踪标志：ttb:p/2，然后当跟踪结束时你必须要停止跟踪器：ttb:stop/0/1，

然后格式化跟踪日志(当然必须有东西让你格式化):ttb:format/1/2。

ttb:trace/0/1/2在你需要跟踪的每个节点上启动一个跟踪端口，默认的情况下，跟踪信息被写入到远程节点上的二进制文件里面（the binary trace log）。

ttb:p/2 指定了要跟踪哪个进程，这个方法给与的跟踪标志（Trace flag）指定了在各个进程上要跟踪的内容。你可以调用这个函数多次来跟踪多个不同的进程。

如果你想要跟踪函数调用（换句话说在进程上设置了call跟踪标志），你必须设置跟踪模式（Pattern）在需要被跟踪的函数上，使用ttb:tp 或者ttb:tpl，只有在跟踪模式设置后函数才会被跟踪。跟踪模式指定了如何用

特殊的匹配格式进行跟踪，匹配格式在User's Guide for the erlang runtime system erts.

ttb:stop/0/1 停止在所有节点上的跟踪，删除所有的跟踪格式和刷新跟踪端口缓存。

ttb:format/1/2 将二进制格式的日志翻译成可读性较强的格式。默认的情况下显示的跟踪信息是一行文本，但是你也可以写自己的处理方法使得跟踪信息显示的更复杂详细。

跟踪日志甚至可以通过Event Tracer application以图形化的方式呈现出来。注意如果在ttb:stop/1中给与了format选项，那格式化会在ttb停止的时候自动处理。

3.实例1：通过shell跟踪本地节点

   -module(m).
-export([f/0]).
   f() ->
      receive
           From when is_pid(From) ->
             Now = erlang:now(),
               From ! {self(),Now}
    end.

下面会展示在erlang shell中ttb的基本使用方法，默认的选项是使用启动跟踪和格式化。这个会在最新创建的目录下创建一个日志文件名字为Node-ttb,Node是你的节点名，

(tiger@durin)47> %% First I spawn a process running my test function
(tiger@durin)47> Pid = spawn(m,f,[]).
<0.125.0>
(tiger@durin)48>
(tiger@durin)48> %% Then I start a tracer...
(tiger@durin)48> ttb:tracer().
{ok,[tiger@durin]}
(tiger@durin)49>
(tiger@durin)49> %% and activate the new process for tracing
(tiger@durin)49> %% function calls and sent messages.
(tiger@durin)49> ttb:p(Pid,[call,send]).
{ok,[{<0.125.0>,[{matched,tiger@durin,1}]}]}
(tiger@durin)50>
(tiger@durin)50> %% Here I set a trace pattern on erlang:now/0
(tiger@durin)50> %% The trace pattern is a simple match spec
(tiger@durin)50> %% indicating that the return value should be
(tiger@durin)50> %% traced. Refer to the reference_manual for
(tiger@durin)50> %% the full list of match spec shortcuts
(tiger@durin)50> %% available.
(tiger@durin)51> ttb:tp(erlang,now,return).
{ok,[{matched,tiger@durin,1},{saved,1}]}
(tiger@durin)52>
(tiger@durin)52> %% I run my test (i.e. send a message to
(tiger@durin)52> %% my new process)
(tiger@durin)52> Pid ! self().
<0.72.0>
(tiger@durin)53>
(tiger@durin)53> %% And then I have to stop ttb in order to flush
(tiger@durin)53> %% the trace port buffer
(tiger@durin)53> ttb:stop([return, {fetch_dir, "fetch"}]).
{stopped, "fetch"}
(tiger@durin)54>
(tiger@durin)54> %% Finally I format my trace log
(tiger@durin)54> ttb:format("fetch").
({<0.125.0>,{m,f,0},tiger@durin}) call erlang:now()
({<0.125.0>,{m,f,0},tiger@durin}) returned from erlang:now/0 ->
{1031,133451,667611}
({<0.125.0>,{m,f,0},tiger@durin}) <0.72.0> !
{<0.125.0>,{1031,133451,667611}}
ok

4.实例2：创建你自己的工具

这个例子是个简单的工具用来"debug tracing",通过返回值来跟踪函数的调用

-module(mydebug).
-export([start/0,trc/1,stop/0,format/1]).
-export([print/4]).
%% Include ms_transform.hrl so that I can use dbg:fun2ms/2 to
%% generate match specifications.
-include_lib("stdlib/include/ms_transform.hrl").
%%% -------------Tool API-------------
%%% ----------------------------------
%%% Star the "mydebug" tool
start() ->
    %% The options specify that the binary log shall be named
    %% <Node>-debug_log and that the print/4 function in this
    %% module shall be used as format handler
    ttb:tracer(all,[{file,"debug_log"},{handler,{{?MODULE,print},0}}]),
    %% All processes (existing and new) shall trace function calls
    %% We want trace messages to be sorted upon format, which requires
    %% timestamp flag. The flag is however enabled by default in ttb.
    ttb:p(all,call).

%%% Set trace pattern on function(s)
trc(M) when is_atom(M) ->
    trc({M,'_','_'});
trc({M,F}) when is_atom(M), is_atom(F) ->
    trc({M,F,'_'});
trc({M,F,_A}=MFA) when is_atom(M), is_atom(F) ->
    %% This match spec shortcut specifies that return values shall
    %% be traced.
    MatchSpec = dbg:fun2ms(fun(_) -> return_trace() end),
    ttb:tpl(MFA,MatchSpec).

%%% Format a binary trace log
format(Dir) ->
    ttb:format(Dir).

%%% Stop the "mydebug" tool
stop() ->
    ttb:stop(return).

%%% --------Internal functions--------
%%% ----------------------------------
%%% Format handler
print(_Out,end_of_trace,_TI,N) ->
    N;
print(Out,Trace,_TI,N) ->
    do_print(Out,Trace,N),
    N+1.

do_print(Out,{trace_ts,P,call,{M,F,A},Ts},N) ->
    io:format(Out,
              "~w: ~w, ~w:~n"
              "Call      : ~w:~w/~w~n"
              "Arguments :~p~n~n",
              [N,Ts,P,M,F,length(A),A]);
do_print(Out,{trace_ts,P,return_from,{M,F,A},R,Ts},N) ->
    io:format(Out,
              "~w: ~w, ~w:~n"
              "Return from : ~w:~w/~w~n"
              "Return value :~p~n~n",
              [N,Ts,P,M,F,A,R]).

为了区别我们这个工具产生的日志跟其他日志，在tracer/2中使用了file选项，因此日志会放在一个ttb_upload_debug_log-YYYYMMDD-HHMMSS目录下,

在启动跟踪的时候使用handle选项后，关于如果格式化文件的信息就保存在跟踪信息文件.ti中，这不是必须的。如果你想在ttb:stop/1中设置format选项后自动格式化日志，那它就非常有用。这表明你不需要知道任何指示关于你的二进制日志的内容是如果按照它需要的方式被格式化的，如果handle选项在启动跟踪和格式化的时候都被使用，那只有格式化时使用的Handle有效果。

call跟踪标志在所有进程都有设置，因此在所有已存在和新的进程上任何呗trc/1命令激活的函数都会被追踪

5.对远程节点上使用Trace Tool Builder

在需要被跟踪的节点上Observer application不是总能使用的（traced node），但是还是有可能从另外一个节点（trace control node）上启动Trace Tool,只要

Observer application 在trace control node上可以使用
在trace control node 和traced node上都可以启动Runtime Tools application

如果Trace Tool Builder要对远程节点使用，那么强烈推荐启动trace control node的时候使用hidden,这样就能连接到traced node而不会呗traced node发现，换句话说，在traced node上会调用 nodes() BIF,但是trace control node就不会标示出来，要启动hidden node，在启动erl时候加上-hidden选项

% erl -sname trace_control -hidden

Diskless node
If the traced node is diskless, ttb must be started from        a trace control node with disk access, and the file option        must be given to the tracer/2 function with the value         {local, File}, e.g.

无盘节点

如果traced node是无盘节点，那么ttb 必须从trace control node上以disk access的方式启动，tracer/2必须使用file 选项value是{local,File}

(trace_control@durin)1> ttb:tracer(mynode@diskless,{file,{local,
{wrap,"mytrace"}}}).
{ok,[mynode@diskless]}

6.额外的跟踪选项

当建立一个跟踪的时候，可能会有以下特性需求

时间限制的跟踪
过载防护
自动恢复

时间限制跟踪

有的时候需要在给定的时间内运行跟踪程序，可以通过额外选项{timer,TimerSpec｝来设置，在TimerSpec的毫秒时间结束后，就会调用ttb:stop/0,ttp:p/2,所以跟踪pattern应该在建立跟踪前设置，ttb:start_trace/4在激活ttb:p/2之前总会建立所有的pattern,注意由于网络和进程延迟，跟踪时间段不是精确的。下面的例子说明了如果建立一个跟踪在5秒后会自动停止和格式化

(tiger@durin)1>ttb:start_trace([node()],
                               [{erlang, now,[]}],
                               {all, call},
                               [{timer, {5000, format}}]).

过载保护

When tracing live systems, special care needs to be always taken not to overload a node with too heavy tracing.ttb provides theoverload option to help to address the problem.

{overload, MSec, Module, Function} instructs the ttb backend (calledobserver_backend, part of theruntime_tools application) to perform overload check everyMSec milliseconds. If the check (namelyModule:Function(check)) returns true, tracing is disabled on the selected node.

Overload protection activated on one node does not affect other nodes, where the tracing continues as normal. ttb:stop/0/1 fetches data from all clients, including everything that has been collected before overload protection was activated. Note that changing trace details (withttb:p andttb:tp/tpl...) once overload protection gets activated in one of the traced nodes is not permitted in order not to allow trace setup to be inconsistent between nodes.

Module:Function provided with the overload option must handle three calls: init,check andstop.init andstop allows to perform some setup and teardown required by the check. An overload check module could look like this (note that check is always called by the same process, soput and get are possible).

当跟踪当前的系统的时候，要一直特别关照不要在一个节点上过载过多的trace，ttb提供了overload选项来帮助处理这些问题，｛overload,Msec,Module,Function}指示ttb后端(调用observer_backend,part of the runtime_tools application)执行过载检查，每过Msec检查一次。如果check(namely Module:Funciton(check))返回真，则跟踪就在选中的节点上停止。

过载保护在一个节点激活不会影响其他节点，ttb:stop/1从所有客户端获取数据包括所有的在过载保护激活之前收集的数据。注意一旦过载保护在traced nodes上被激活，更改跟踪细节（ttb:p和ttb:tp/tpl）是不被允许的，是为了保持节点之间的trace setup的一致性。

使用overload的Module:Function必须处理3个调用：init,check,stop。init和stop允许根据check的需求执行一些安装和关闭。过载check模块应该是这样的（注意check总是被同一进程调用，所以put和get是被允许的）。

-module(overload).
-export([check/1]).

check(init) ->
    Pid = sophisticated_module:start(),
    put(pid, Pid);
check(check) ->
    get(pid) ! is_overloaded,
    receive
        Reply ->
            Reply
    after 5000 ->
            true
    end;
check(stop) ->
    get(pid) ! stop.

Autoresume

It is possible that a node (probably a buggy one, hence traced) crashes. In order to automatically resume tracing on the node as soon as it gets back,resume has to be used. When it is, the failing node tries to reconnect to trace control node as soon asruntime tools is started. This implies thatruntime_tools must be included in other node's startup chain (if it is not, one could still resume tracing by startingruntime_tools manually, i.e. by an RPC call).

In order not to loose the data that the failing node stored up to the point of crash, the control node will try to fetch it before restarting trace. This must happen within the allowed time frame or is aborted (default is 10 seconds, can be customized with {resume, MSec}). The data fetched this way is then merged with all other traces.

Autostart feature requires additional data to be stored on traced nodes. By default, the data is stored automatically to the file called "ttb_autostart.bin" in the traced node's cwd. Users may decide to change this behaviour (i.e. on diskless nodes) by specifying their own module to handle autostart data storage and retrieval (ttb_autostart_module environment variable ofruntime_tools). Please see the ttb's reference manual to see the module's API. This example shows the default handler

自动恢复

为了让崩溃的节点在重启后自动恢复跟踪，就要使用resume.崩溃节点尝试重新连接到trace control node一旦runtime_tools启动，这个说明runtme_tools必须被包含进其他节点的启动链中（如果没有包含进启动链，也可以恢复追踪通过手动启动runtime_tools,比如RPC CALL）为了让节点在崩溃时保存的数据不丢失，control node 将尝试在重启跟踪之前获取它，这个必须在允许的时间内或者终止(默认10秒，也可以通过{resume,MSec}设置)，通过这种方式获取的数据就会被合并至其他所有的跟踪内。

自动启动特性需要额外的数据保存在跟踪节点内，默认的，数据是自动储存至文件“ttb_autostart.bin"在traced nodes的cwd里面，用户可能会决定处理这种行为（比如在diskless节点上）通过指定自己的模块来处理自动重启的数据存储和取回（ttb_auostart_module evironment variable of runtime_tools），请看ttb‘s的文档查阅模块API，这个例子使用默认的handler

-module(ttb_autostart).
-export([read_config/0,
write_config/1,
         delete_config/0]).

-define(AUTOSTART_FILENAME, "ttb_autostart.bin").

delete_config() ->
    file:delete(?AUTOSTART_FILENAME).

read_config() ->
    case file:read_file(?AUTOSTART_FILENAME) of
        {ok, Data} -> {ok, binary_to_term(Data)};
        Error      -> Error
    end.

write_config(Data) ->
    file:write_file(?AUTOSTART_FILENAME, term_to_binary(Data)).

Remember that file trace ports buffer the data by default. If the node crashes, trace messages are not flushed to the binary log. If the chance of failure is high, it might be a good idea to automatically flush the buffers every now and then. Passing {flush, MSec} as one ofttb:tracer/2 option flushes all buffers everyMSec milliseconds.

记住文件跟踪端口进行缓存数据的方式是默认的，如果节点崩溃，跟踪信息是不会被记录进二进制日志中的，如果崩溃的频率较高，那么自动写入缓存是一个好注意，用

｛flush,Msec}作为ttb:tracer/2的选项来写入缓存

7.Formatting

格式化会在自动结束ttb模块或者调用tt:format/1/2函数时调用。

格式化意味着读取一个二进制日志，然后用可读性良好的格式展示出来。你可以用默认方式处理也可以自己写处理handler，甚至可以用Event Tracer et 来图形化显示出来。

ttb:format/1/2的第一个参数表明了选择那个二进制日志格式化，这通常是一个目录名在获取日志的时候ttb创建的，除非有disable_sort选项，不同文件中的日志都是通过跟踪的时间戳来排序的。

第二参数是列表选项，out选项表明格式化文本要被写入的地方，默认的地方是standard_io,但是也能用自己的文件名。handler选项表明了格式化handler的使用，如果没有这个选项，handle选项在启动跟踪的时候会被使用，如果在启动追踪的时候也没使用，那么默认的handler就会使用，它会以一行的本文打印出跟踪信息，disable_sort选项表明日志不根据时间戳来排序，而且根据进程产生的文件顺序（这个速度会快点）

A format handler is a fun taking four arguments. This fun will be called for each trace message in the binary log(s). A simple example which only prints each trace message could be like this:

format handler是个有4个参数的fun函数，对于二进制日志中的每条追踪信息fun函数都将被调用。

fun(Fd, Trace, _TraceInfo, State) ->
   io:format(Fd, "Trace: ~p~n", [Trace]),
   State
end.

Fd是目标文件的描述符或者原子standard_io。_TraceInfo包含来自跟踪信息文件中的信息，State是format handler fun的状态变量，State的初始化值是handle选项给与的

ttb:format("tiger@durin-ttb", [{handler, {{Mod,Fun}, initial_state}}])

^^^^^^^^^^^^^

Another format handler could be used to calculate time spent by the garbage collector:

另外一个format handler可以用来计算垃圾收集的花费时间

fun(_Fd,{trace_ts,P,gc_start,_Info,StartTs},_TraceInfo,State) ->
          [{P,StartTs}|State];
   (Fd,{trace_ts,P,gc_end,_Info,EndTs},_TraceInfo,State) ->
         {value,{P,StartTs}} = lists:keysearch(P,1,State),
         Time = diff(StartTs,EndTs),
         io:format("GC in process ~w: ~w milliseconds~n", [P,Time]),
         State -- [{P,StartTs}]
end

Start tracing:

(tiger@durin)1> ttb:tracer(node(),{file,{wrap,"trace"}}).
{ok,[tiger@durin]}
(tiger@durin)2> ttb:p(...)
...

This will give a set of binary logs, like:

tiger@durin-trace.0.wrp
tiger@durin-trace.1.wrp
tiger@durin-trace.2.wrp
...

Format the whole set of logs:

1> ttb:format("tiger@durin-trace.*.wrp").
....
ok
2>

Format only the first log:

1> ttb:format("tiger@durin-trace.0.wrp").
....
ok
2>

To merge all wrap logs from two nodes:

1> ttb:format(["tiger@durin-trace.*.wrp","lion@durin-trace.*.wrp"]).
....
ok
2>

8.用Event Tracer显示跟踪日志

-module(foo).
-export([start/0,go/0]).

start() ->
    spawn(?MODULE, go, []).

go() ->
    receive
        stop ->
            ok;
        go ->
            bar:f1(),
            go()
    end.

-module(bar).
-export([f1/0,f3/0]).
f1() ->
    f2(),
    ok.
f2() ->
    spawn(?MODULE,f3,[]).
f3() ->
    ok.

Now let's set up the trace.

(tiger@durin)1>%%First we retrieve the Pid to limit traced processes set
(tiger@durin)1>Pid = foo:start().
(tiger@durin)2>%%Now we set up tracing
(tiger@durin)2>ttb:tracer().
(tiger@durin)3>ttb:p(Pid, [call, return_to, procs, set_on_spawn]).
(tiger@durin)4>ttb:tpl(bar, []).
(tiger@durin)5>%%Invoke our test function and see output with et viewer
(tiger@durin)5>Pid ! go.
(tiger@durin)6>ttb:stop({format, {handler, ttb:get_et_handler()}}).

This shoud render a result similar to the following:

IMAGE MISSING
Figure 2.1:   Filter: "processes" IMAGE MISSING
Figure 2.2:   Filter: "mods_and_procs"
Note, that we can use ttb:start_trace/4 function to help us here:

(tiger@durin)1>Pid = foo:start().
(tiger@durin)2>ttb:start_trace([node()],
                               [{bar,[]}],
                               {Pid, [call, return_to, procs, set_on_spawn]}
                               {handler, ttb:get_et_handler()}).
(tiger@durin)3>Pid ! go.
(tiger@durin)4>ttb:stop(format).

9.从所有节点中自动收集和格式化日志

默认的通过ttb :stop/1 来从所有节点上获取跟踪日志和跟踪信息，日志存储在新的目录下ttb_upload-Filename-Timestamp,目录在trace control node的工作目录中，可以通过ttb:stop/1 nofecth选项来禁止搜集日志，如果format选项在ttb:stop/1中使用，在跟踪停止后，跟踪日志会自动格式化

10.连续跟踪

连续跟踪的支持来自Trace Tool Builder

开始系统跟踪，当跟踪端口开始ttb:tracer/0/1/2，开始系统追踪会自动完成
当激活连续跟踪的时候创建匹配格式

开始连续跟踪要求跟踪器启动ttb:tracer/0/1/2函数，连续跟踪可以用trigger函数跟用ttb:seq_trigger_ms/0/1创建的匹配格式启动，或者直接用kernel application中的seq_trace模块启动

连续跟踪例子：

下面的例子，连续跟踪的trigger是dbg:get_tracer/0:
(tiger@durin)110> ttb:tracer().
{ok,[tiger@durin]}
(tiger@durin)111> ttb:p(self(),call).
{ok,{[<0.158.0>],[call]}}
(tiger@durin)112> ttb:tp(dbg,get_tracer,0,ttb:seq_trigger_ms(send)).
{ok,[{matched,1},{saved,1}]}
(tiger@durin)113> dbg:get_tracer(), seq_trace:reset_trace().
true
(tiger@durin)114> ttb:stop(format).
({<0.158.0>,{shell,evaluator,3},tiger@durin}) call dbg:get_tracer()
SeqTrace [0]: ({<0.158.0>,{shell,evaluator,3},tiger@durin})
{<0.237.0>,dbg,tiger@durin} ! {<0.158.0>,{get_tracer,tiger@durin}}
[Serial: {0,1}]
SeqTrace [0]: ({<0.237.0>,dbg,tiger@durin})
{<0.158.0>,{shell,evaluator,3},tiger@durin} ! {dbg,{ok,#Port<0.222>}}
[Serial: {1,2}]
ok
(tiger@durin)116>

用trigger启动连续跟踪的时候当trigger函数并不是从shell中直接调用的时候会更有用，但是在大型系统会更隐蔽。当从shell中调用一个函数的时候，直接启动连续跟踪会更简单

(tiger@durin)116> ttb:tracer().
{ok,[tiger@durin]}
(tiger@durin)117> seq_trace:set_token(send,true), dbg:get_tracer(),
seq_trace:reset_trace().
true
(tiger@durin)118> ttb:stop(format).
SeqTrace [0]: ({<0.158.0>,{shell,evaluator,3},tiger@durin})
{<0.246.0>,dbg,tiger@durin} ! {<0.158.0>,{get_tracer,tiger@durin}}
[Serial: {0,1}]
SeqTrace [0]: ({<0.246.0>,dbg,tiger@durin})
{<0.158.0>,{shell,evaluator,3},tiger@durin} ! {dbg,{ok,#Port<0.229>}}
[Serial: {1,2}]
ok
(tiger@durin)120>

All functions in the seq_trace module, except set_system_tracer/1, can be used after the trace port has been started with ttb:tracer/0/1/2.

在上面两个例子中，在跟踪函数后seq_trace:reset_trace/0 会立即复位trace token 是为了避免大量的trace messages 因为erlang shell会将他们打印输出

在seq_trace模块中的所有函数，除了set_system_tracer/1,可以在跟踪端口被ttb:tracer/1/2启动后使用

11.实例：Multipurpose trace tool

模块multitrace.erl可以在src目录下找到，跟踪信息可以用格式化的方法multitrace:format/1/2写入二进制文件

multitrace:debug(What)

启动跟踪调用在所有进程上，跟踪给定的函数。format handler 是multirtace:handle_debug/4 ，会打印出每个调用和返回。 what必须是一个要跟踪的项目或者项目列表，

使用格式{Module,Function,Arity},{Module,Function} or just Module.

multitrace:gc(Procs)

跟踪指定进程上的垃圾回收，format handler 是multitrace:handle_schedule/4 ，打印出每个GC的开始时间，结束时间，花费时间

multitrace:schedule(Procs)

跟踪指定进程上的进出时间表，format handler是multitrace:handle_chedule/4，根据进程，时间戳和当前函数打印出每个进出时间表，它也打印每个跟踪进程且有时间表的进程的总时间