XGBoost:二分类问题

来源:互联网 发布:数控车床编程基础视频 编辑:程序博客网 时间:2024/05/16 16:56

本文介绍XGBoost的命令行使用方法。Python和R的使用方法见https://github.com/dmlc/xgboost/blob/master/doc/README.md 。 
下面将介绍如何利用XGBoost解决二分类问题。以下使用的数据集见mushroom dataset

简介

产生输入数据

XGBoost的输入数据格式和LibSVM一样。下面是XGBoost使用的输入数据格式:

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">101</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.2</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">102</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.03</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.1</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10001</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">300</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10002</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">400</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

每行表示一个样本,第一列的数字表示类别标签,表示样本所属于的类别,‘101’和‘102’表示特征索引,’1.2‘和‘0.03’是特征所对应的值。在二分类中‘1’表示正类,‘0’表示负类。同时类别标签支持概率标签,取值服务i为[0,1],表示样本属于某个类别的可能性。

第一步需要将数据集转化成libSVM形式,执行如下脚本

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">python mapfeat<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span>python mknfold<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.py</span> agaricus<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

mapfeat.py和mknfold.py分别如下

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">loadfmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fname )</span>:</span>    fmap = {}    nmap = {}    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( fname ):        arr = l.split()        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].find(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) != -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:            idx = int( arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.'</span>) )            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> idx <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> fmap            fmap[ idx ] = {}            ftype = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>].strip(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">':'</span>)            content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>]        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:            content = arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> it <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> content.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>):            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> it.strip() == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">''</span>:                <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">continue</span>            k , v = it.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>)            fmap[ idx ][ v ] = len(nmap)            nmap[ len(nmap) ] = ftype+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'='</span>+k    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> fmap, nmap<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">write_nmap</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">( fo, nmap )</span>:</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( len(nmap) ):        fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'%d\t%s\ti\n'</span> % (i, nmap[i]) )<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># start here</span>fmap, nmap = loadfmap( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.fmap'</span> )fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'featmap.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )write_nmap( fo, nmap )fo.close()fo = open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus.txt'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> open( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'agaricus-lepiota.data'</span> ):    arr = l.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">','</span>)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'p'</span>:        fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'1'</span>)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">assert</span> arr[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'e'</span>        fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'0'</span>)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,len(arr) ):        fo.write( <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">' %d:1'</span> % fmap[i][arr[i].strip()] )    fo.write(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'\n'</span>)fo.close()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li></ul>
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#!/usr/bin/python</span>import sysimport <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>:    print (<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'Usage:<filename> <k> [nfold = 5]'</span>)    exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.seed( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> )k = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>] )<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(sys.argv) > <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>:    nfold = int( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] )<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:    nfold = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>fi = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>], <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'r'</span> )ftr = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.train'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )fte = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">open</span>( sys.argv[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'.test'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span> )<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> l <span class="hljs-operator" style="box-sizing: border-box;">in</span> fi:    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">random</span>.randint( <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> , nfold ) == k:        fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l )    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:        ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">write</span>( l )fi.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()ftr.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()fte.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">close</span>()</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li></ul>

运行完以上两个Python脚本将会产生训练数据集:’agaricus.txt.train’ 和测试数据集: ‘agaricus.txt.test’

训练

执行如下命令行完成模型训练:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

mushroom.conf文件用于配置训练模型和测试模型时需要的信息。每行的配置信息格式为:[attribute]=[value]:

<code class="language-conf hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># General Parameters, see comment for each definition</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># can be gbtree or gblinear</span>booster = gbtree <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># choose logistic regression loss function for binary classification</span>objective = binary:logistic<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Tree Booster Parameters</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># step size shrinkage</span>eta = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum loss reduction required to make a further partition</span>gamma = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># minimum sum of instance weight(hessian) needed in a child</span>min_child_weight = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># maximum depth of a tree</span>max_depth = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># Task Parameters</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># the number of round to do boosting</span>num_round = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># 0 means do not save any model except the final round model</span>save_period = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of training data</span>data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of validation data, used to monitor training process, here [test] sets name of the validation set</span>eval[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span> <span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;"># The path of test data </span>test:data = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span>      </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li></ul>

这里的booster采用gbtree,目标函数采用logistic regression。这意味着可以采用经典的梯度提升回归树进行计算(GBRT)。这种方法能够很好的处理二分类问题

以上的配置文件中给出了最常用的配置参数。如果想了解更多的参数,详见https://github.com/dmlc/xgboost/blob/master/doc/parameter.md。如果不想在配置文件中配置算法参数,可以通过命令行配置,如下

<code class="hljs fix has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attribute" style="box-sizing: border-box;">xgboost mushroom.conf max_depth</span>=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">6</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

这表示max_depth参数将被设置为6而不是配置文件中的3。当使用命令行参数时确保max_depth=6为一个参数,即参数之间不要含有间隔。如果既使用配置又使用命令行参数,则命令行参数会覆盖配置文件参数,即优先使用命令行参数

在以上的例子中使用tree booster计算梯度提升。如果想使用linear booster进行回归计算,可以修改booster参数为gblinear,配置文件中的其它参数都不需要修改,配置文件信息如下

<code class="language-conf hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># General Parameters</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># choose the linear booster</span>booster = gblinear...<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Change Tree Booster Parameters into Linear Booster Parameters</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on weights, default 0</span>lambda = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L1 regularization term on weights, default 0</span>f ```agaricus.txt.test.buffer``` exists, <span class="hljs-operator" style="box-sizing: border-box;">and</span> automatically loads <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> binary buffer <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> possible, this can speedup training <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">process</span> when you <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> training many times. You can disable <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">it</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> setting ```use_buffer=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>```.  - Buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> can also be used <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> standalone input, i.e <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> buffer <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">file</span> exists, but original agaricus.txt.test was removed, xgboost will still run* Deviation <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> LibSVM input <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>: xgboost is compatible <span class="hljs-operator" style="box-sizing: border-box;">with</span> LibSVM <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">format</span>, <span class="hljs-operator" style="box-sizing: border-box;">with</span> <span class="hljs-operator" style="box-sizing: border-box;">the</span> following minor differences:  - xgboost allows feature index starts <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">from</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>  - <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> binary classification, <span class="hljs-operator" style="box-sizing: border-box;">the</span> label is <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> positive, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> negative, instead <span class="hljs-operator" style="box-sizing: border-box;">of</span> +<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>  - <span class="hljs-operator" style="box-sizing: border-box;">the</span> feature indices <span class="hljs-operator" style="box-sizing: border-box;">in</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">each</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">line</span> *<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">do</span> <span class="hljs-operator" style="box-sizing: border-box;">not</span>* need <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">to</span> be sortedalpha = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># L2 regularization term on bias, default 0</span>lambda_bias = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Regression Parameters</span>...</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li></ul>

预测

在训练好模型之后,可以对测试数据进行预测,执行如下脚本

<code class="hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom.conf task=pred model_<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span>=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003</span>.model</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

对于二分类问题预测的输出结果为[0,1]之间的概率值,表示样本属于正类的概率。

模型展示

目前这还是个基本功能,只支持树模型的展示。XGBoost可以用文本的显示展示树模型,执行以下脚本

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.raw</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> ../../xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> task=dump model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0003.</span>model fmap=featmap<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span> name_dump=dump<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.nice</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

0003.model将会输出到dump.raw.txt和dump.nice.txt中。dump.nice.txt中的结果更容易理解,因为其中使用了特征映射文件featmap.txt

featmap.txt的格式为 featmap.txt: <featureid> <featurename> <q or i or int>\n:

  • Feature id从0开始直到特征的个数为止,从小到大排列。
  • i表示是二分类特征
  • q表示数值变量,如年龄,时间等。q可以缺省
  • int表示特征为整数(when int is hinted, the decision boundary will be integer)

计算过程监测

当运行程序时,会输出如下运行信息

<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">tree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]  test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.016139</span>boosting <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">round</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> sec elapsedtree train <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">end</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> roots, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> extra nodes, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span> pruned nodes ,max_depth=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]  test-<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">error</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>

计算过程中模型评价信息输出到错误输出流stderr中,如果希望记录计算过程中的模型评价信息,可以执行如下脚本

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>>log<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.txt</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

在log.txt文件中记录如下信息

<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

也可以同时监测训练过程和测试过程中的统计信息,可以通过如下方式进行配置

<code class="language-conf hljs bash has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[test] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.test"</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">eval</span>[trainname] = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"agaricus.txt.train"</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

运行以上的脚本后得到的信息如下

<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

运行规则是[name-printed-in-log] = filename, filename文件将会被加入检测进程并在每个迭代过程中对模型进行评价。

XGBoost同时支持多种统计量的监测,假设希望监测在训练过程每次预测的平均log-likelihood,只需要在配置文件中添加配置信息 eval_metric=logloss。再次运行log文件中将会有如下信息

<code class="hljs css has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[0]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.016139</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.029795</span>   <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.014433</span>        <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.027023</span><span class="hljs-attr_selector" style="color: rgb(0, 136, 0); box-sizing: border-box;">[1]</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>     <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">test-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.000000</span>   <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-error</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.001228</span>        <span class="hljs-tag" style="color: rgb(0, 0, 0); box-sizing: border-box;">trainname-negllik</span><span class="hljs-pseudo" style="color: rgb(0, 0, 0); box-sizing: border-box;">:0</span><span class="hljs-class" style="box-sizing: border-box; color: rgb(155, 112, 63);">.002457</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

保存运行过程中的模型

如果现在运行过程中每两步保存一个模型,则可以设置参数set save_period=2.。在当前文件夹将会看到模型0002.model。如果想修改模型输出的路径,则可以通过参数dir=foldername修改。缺省情况下XGBoost将会保持上次迭代的结果模型。

从已有模型继续计算

如果想从已有的模型继续训练,例如从0002.model继续计算,则用如下命令行

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">xgboost mushroom<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.conf</span> model_in=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0002.</span>model num_round=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> model_out=continue<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.model</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

XGBoost将加载0002.model并进行两次迭代计算,并将输出明显保存在continue.model。需要注意的是 在mushroom.conf中定义的训练数据和评价数据信息不能发生变化。

使用多线程

当计算大数据集时,可能需要并行计算。如果编译器支持OpenMP,XGBoost原生是支持多线程的,通过一下参数nthread=10设置线程数为10。

其它需要注意的点

  • agaricus.txt.test.buffer 和 agaricus.txt.train.buffer是什么文件 
    • 默认情况下XGBoost将会产生二进制的缓存文件,文件后缀为 buffer。当下次再次运行XGBoost时将加载缓存文件而不是原始的文件。
0 0
原创粉丝点击