Introduction to Boosted Trees

来源:互联网 发布:淘宝刷话费教程 编辑:程序博客网 时间:2024/05/29 07:50

  • Review of concepts of supervised learning
    • objective function
    • loss function LLTheta
    • regularization OmegaTheta
    • lasso
    • ridge regression
    • logistic regression
  • boosted trees
    • Regression treeCART
    • Regression tree ensemble
    • tree ensemble methods
    • represent the RT in optimization method
    • how to solve the optimization problem
    • additive training boosting
    • how we decide which f to add
  • how to define F the space of functions containing all regression trees
    • map concept of tree to optimization
    • refine the definition of tree
    • Define the Complexity of Tree
    • revisit the instance set in leaf j as
    • The structure score
    • EXAMPLE
    • search algorithm for single tree
    • Efficient Finding of the Best Split
    • Algorithm for Split Finding
    • What about Categorical Variables
    • Pruning and Regularization
  • Recap Boosted Tree Algorithm
    • Questions 1
    • Questions 2
    • Summary

Review of concepts of supervised learning

objective function

Obj(Θ)=L(Θ)+Ω(Θ)(2.1)

where L(Θ) is the training loss,and Ω(Θ) is regularization which measures the complexity of model.

loss function L(Θ)

  • square loss:l(yi,ŷ i)=yiŷ i2
  • logistic loss:l(yi,ŷ i)=yiln(1+eŷ i)+ŷ iln(1+eyi)

regularization Ω(Θ)

  • l1 norm (lasso): Ω(w)=λw1
  • l2 norm (ridge): Ω(w)=λw22

lasso

i=1nyiŷ i2+λw22

- linear model,square loss,l1 regularition

ridge regression

i=1nyiŷ i2+λw22

- linear model,square loss,l2 regularition

logistic regression

i=1n[yiln(1+eŷ i)+ŷ iln(1+eyi)]+λw22

- linear model, logistic loss,l2 regulation

boosted trees

Regression tree(CART)

regression tree, or classification and regression tree,contains one score in each leaf value.
这里写图片描述

Regression tree ensemble

prediction of each menber is sum of scores predicted by each of the tree
这里写图片描述

tree ensemble methods

  • mostly widely used ,such as GBM,random forest …
  • invariante to scaling of inputs,so needn’t worry about feature normalization
  • learn higehr order interaction between features

represent the RT in optimization method

  • model: assuming we have K trees

    ŷ i=k=1Kfk(xi),fk(x)F

    • F is space of functions containing all regression trees.
    • regression tree is a function that maps the attitute to the score.
    • each regression tree fk(x) is in charge of some parts of the attributes.
  • Parametres: struction of each tree fk(x) ,and the score in the leaf
    or simply Θ={f1,,fK}

  • Objective

    Obj(Θ)=i=1nl(yi,ŷ i)+k=1KΩ(fk)(2.3)

  • how to define to define Ω?
    • # nodes/depth
    • l2 norm of the leaf weights
    • ….
  • how to define loss function
    • square loss result in common gradient boosted machine
    • logistic loss result in logitBoost

how to solve the optimization problem

additive training (boosting)

at training round 0,start from constant prediction, add a new function each time:
ŷ (0)i=0
ŷ (1)i=f1(xi)=ŷ (0)i+f1(xi)
ŷ (2)i=f1(xi)+f2(xi)=ŷ (1)i+f2(xi)

at traing round t,Keep functions added in previous round(ŷ (k1)i) and add a new function fk(xi):
ŷ (t)i=k=1tfk(xi)=ŷ (k1)i+fk(xi)

how we decide which f to add?

at round t,solve the optimization:

minftObj(t)=i=1nl(yi,ŷ (t1)i+ft(xi))+i=1tΩ(fi)=i=1nl(yi,ŷ (t1)i+ft(xi))+Ω(ft)+C(2.3)

for lose function:

  • square loss: Obj(t) becomes a quadratic function of ft, let the derivetive be 0 then…
  • other cases ,we can use the numeric ways:
    • define gi=ŷ (t1)l(yi,ŷ (t1)i),hi=2ŷ (t1)l(yi,ŷ (t1)i)
    • the taylor expansion of Obj(t):
      Obj(t)i=1n[l(yi,ŷ (t1)i)+gift(xi)+12h2ift(xi)]+Ω(ft)+C
    • with constants removed, Obj(t)becomes:
      i=1n[gift(xi)+12h2ift(xi)]+Ω(ft)

so again we retain to a quadratic convex optimization.

how to define F: the space of functions containing all regression trees

map concept of tree to optimization

information gain -> train loss
pruning->regularization defined be #nodes
max depth -> constraint on the function space
smoothing leaf values -> l2 regularization on leaf weights

refine the definition of tree

  • We define tree by a vector of scores in leafs, and a leaf index mapping function q(x) that maps an instance to a leaf
    ft(x)=wq(x),wRT,q:Rd{1,2,,T}

    • T is # leaf
    • q(x) means that sample falls into which leaf,such as 1,3
    • wy means the weight of leaf y

这里写图片描述

Define the Complexity of Tree

  • Define complexity as (this is not the only possible definition)
    Ω(ft)=γT+12λj=1Tw2j

这里写图片描述

revisit the instance set in leaf j as

  • Define the instance set in leaf j as
    Ij={i|q(xi)=j}
  • Regroup the objective by each leaf
    Obj(t)i=1n[gift(xi)+12h2ift(xi)]+Ω(ft)=i=1n[gift(xi)+12h2ift(xi)]+γT+12λj=1Tw2j=j=1T[iIjgiwj+12(iIjhi+λ)w2j]+γT

    • This is sum of T independent quadratic functions

The structure score

Two facts about single variable quadratic function:

argminxGx+12Hx2x=GH,minxGx+12Hx2=G22H

define Gj=iIjgj,Hj=iIjhj,then objective becomes
Obj(t)=j=1T[iIjgiwj+12(iIjhi+λ)w2j]+γT=j=1T[Giwj+12(Hj+λ)w2j]+γT

  • Assume the structure of tree q(x) is fixed, the optimal weight in each leaf, and the resulting objective value are:
    wj=GjHj+λ
  • then objective becomes:
    Obj=12j=1TG2jHj+λ+γT

EXAMPLE

here comes the example

这里写图片描述

search algorithm for single tree

  • enumerate the possible tree structures q(x)
  • Calculate the according structure score for the q(x), using the scoring function:
    Obj=12j=1TG2jHj+λ+γT
  • find the best tree structure,and use the optimal leaf weight
    wj=G2jHj+λ

But… there can be infinite possible tree structures..

so in practice we do not enumerate but greedly grow the tree:

  • start from tree with depth 0
  • For each leaf node of the tree, try to add a split. The change of
    objective after adding the split is:

    Gain=G2LHL+λ+G2RHR+λ(GL+GR)2HL+HR+λγ

    • the score of left/right child :G2LHL+λ&G2RHR+λ
    • the score of if we do not split :(GL+GR)2HL+HR+λ
    • The complexity cost by introducing additional leaf:γ
  • Remaining question: how do we find the best split?

Efficient Finding of the Best Split

first take a look at the Algorithm for Split Finding

Algorithm for Split Finding

  • For each node, enumerate over all features
  • For each feature, sorted the instances by feature value
  • Use a linear scan to decide the best split along that feature
  • Take the best split solution along all the features

Time Complexity growing a tree of depth K is O(ndKlogn),or, each level needO(nlogn)time to sort.
There are d features, and we need to do it for K level

  • This can be further optimized (e.g. use approximation or caching the sorted features)
  • Can scale to very large dataset

Let f and g be two functions defined on some subset of the real numbers. One writes

What about Categorical Variables?

  • Some tree learning algorithm handles categorical variable and continuous variable separately.Actually it is not necessary,
  • We can easily use the scoring formula we derived to score split based on categorical variables.We can encode the categorical variables into numerical vector
    using one-hot encoding. Allocate a #categorical length vector
    zj={1,0,if x is in category jothers
  • The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data

Pruning and Regularization

  • Pre-stopping

    • Stop split if the best split have negative gain
    • But maybe a split can benefit future splits..
  • Post-Prunning

    • Grow a tree to maximum depth,
    • recursively prune all the leaf splits with negative gain

Recap: Boosted Tree Algorithm

  • Add a new tree in each iteration
  • Beginning of each iteration, calculate gi,hi
  • Use the statistics to greedily grow a tree ft(x)
    Obj=12j=1TG2jHj+λ+γT
  • Add ft(x) to the model ŷ (t)i=ŷ (t1)i+ft(xi)
    Usually, instead we do ŷ (t)i=ŷ (t1)i+ϵft(xi)
    ϵ is called step-size or shrinkage, usually set around 0.1
    This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting

Questions 1

How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight?

  • Define objective, calculate, feed it to the old tree learning algorithm we have for un-weighted version
  • Again think of separation of model and objective, how does the theory can help better organizing the machine learning toolkit

Questions 2

Back to the time series problem, if I want to learn step functions over time. Is there other ways to learn the time splits, other than the top down split approach?

这里写图片描述

  • All that is important is the structure score of the splits
    Obj=12j=1TG2jHj+λ+γT

    • Top-down greedy, same as trees
    • Bottom-up greedy, start from individual points as each group, greedily merge neighbors
    • Dynamic programming, can find optimal solution for this case

Summary

• The separation between model, objective, parameters can be helpful for us to understand and customize learning models
• The bias-variance trade-off applies everywhere, including learning in functional space
• We can be formal about what we learn and how we learn. Clear understanding of theory can be used to guide cleaner implementation.

0 0