Introduction to Boosted Trees

来源：互联网发布：淘宝刷话费教程编辑：程序博客网时间：2024/05/29 07:50

Review of concepts of supervised learning
- objective function
- loss function LLTheta
- regularization OmegaTheta
- lasso
- ridge regression
- logistic regression
boosted trees
- Regression treeCART
- Regression tree ensemble
- tree ensemble methods
- represent the RT in optimization method
- how to solve the optimization problem
- additive training boosting
- how we decide which f to add
how to define F the space of functions containing all regression trees
- map concept of tree to optimization
- refine the definition of tree
- Define the Complexity of Tree
- revisit the instance set in leaf j as
- The structure score
- EXAMPLE
- search algorithm for single tree
- Efficient Finding of the Best Split
- Algorithm for Split Finding
- What about Categorical Variables
- Pruning and Regularization
Recap Boosted Tree Algorithm
- Questions 1
- Questions 2
- Summary

Review of concepts of supervised learning

objective function

O b j (Θ) = L (Θ) + Ω (Θ) (2.1)

where L(Θ) is the training loss,and Ω(Θ) is regularization which measures the complexity of model.

loss function L(Θ)

square loss:l(yi,ŷ i)=∥yi−ŷ i∥2
logistic loss:l(yi,ŷ i)=yiln(1+e−ŷ i)+ŷ iln(1+e−yi)

regularization Ω(Θ)

l1 norm (lasso): Ω(w)=λ∥w∥1
l2 norm (ridge): Ω(w)=λ∥w∥22

lasso

\sum i = 1 n ∥ y i - y ̂ i ∥ 2 + λ ∥ w ∥ 22

- linear model,square loss,l1 regularition

ridge regression

\sum i = 1 n ∥ y i - y ̂ i ∥ 2 + λ ∥ w ∥ 22

- linear model,square loss,l2 regularition

logistic regression

\sum i = 1 n [y i ln (1 + e - y ̂ i) + y ̂ i ln (1 + e - y i)] + λ ∥ w ∥ 22

- linear model, logistic loss,l2 regulation

boosted trees

Regression tree(CART)

regression tree, or classification and regression tree,contains one score in each leaf value.
这里写图片描述

Regression tree ensemble

prediction of each menber is sum of scores predicted by each of the tree
这里写图片描述

tree ensemble methods

mostly widely used ,such as GBM,random forest …
invariante to scaling of inputs,so needn’t worry about feature normalization
learn higehr order interaction between features

represent the RT in optimization method

model: assuming we have K trees
ŷ i=∑k=1Kfk(xi),fk(x)∈F
- F is space of functions containing all regression trees.
- regression tree is a function that maps the attitute to the score.
- each regression tree fk(x) is in charge of some parts of the attributes.
Parametres: struction of each tree fk(x) ,and the score in the leaf
or simply Θ={f1,…,fK}
Objective
$O b j (Θ) = \sum i = 1 n l (y i, y ̂ i) + \sum k = 1 K Ω (f k) (2.3)$
how to define to define Ω?
- # nodes/depth
- l2 norm of the leaf weights
- ….
how to define loss function
- square loss result in common gradient boosted machine
- logistic loss result in logitBoost

how to solve the optimization problem

additive training (boosting)

at training round 0,start from constant prediction, add a new function each time:
ŷ (0)i=0
ŷ (1)i=f1(xi)=ŷ (0)i+f1(xi)
ŷ (2)i=f1(xi)+f2(xi)=ŷ (1)i+f2(xi)
…
at traing round t,Keep functions added in previous round(ŷ (k−1)i) and add a new function fk(xi):
ŷ (t)i=∑k=1tfk(xi)=ŷ (k−1)i+fk(xi)

how we decide which f to add?

at round t,solve the optimization:

min f t O b j (t) = \sum i = 1 n l (y i, y ̂ (t - 1) i + f t (x i)) + \sum i = 1 t Ω (f i) = \sum i = 1 n l (y i, y ̂ (t - 1) i + f t (x i)) + Ω (f t) + C (2.3)

for lose function:

square loss: Obj(t) becomes a quadratic function of ft, let the derivetive be 0 then…
other cases ,we can use the numeric ways:
- define gi=∂ŷ (t−1)l(yi,ŷ (t−1)i),hi=∂2ŷ (t−1)l(yi,ŷ (t−1)i)
- the taylor expansion of Obj(t):
  Obj(t)≈∑i=1n[l(yi,ŷ (t−1)i)+gift(xi)+12h2ift(xi)]+Ω(ft)+C
- with constants removed, Obj(t)becomes:
  $\sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + Ω (f t)$

so again we retain to a quadratic convex optimization.

how to define F: the space of functions containing all regression trees

map concept of tree to optimization

information gain -> train loss
pruning->regularization defined be #nodes
max depth -> constraint on the function space
smoothing leaf values -> l2 regularization on leaf weights

refine the definition of tree

We define tree by a vector of scores in leafs, and a leaf index mapping function q(x) that maps an instance to a leaf
ft(x)=wq(x),w∈RT,q:Rd→{1,2,…,T}
- T is # leaf
- q(x) means that sample falls into which leaf,such as 1,3
- wy means the weight of leaf y

这里写图片描述

Define the Complexity of Tree

Define complexity as (this is not the only possible definition)
$Ω (f t) = γ T + 1 2 λ \sum j = 1 T w 2 j$

这里写图片描述

revisit the instance set in leaf j as

Define the instance set in leaf j as
$I j = {i | q (x i) = j}$
Regroup the objective by each leaf
$O b j (t) \approx \sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + Ω (f t) = \sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + γ T + 1 2 λ \sum j = 1 T w 2 j = \sum j = 1 T [\sum i \in I j g i w j + 1 2 (\sum i \in I j h i + λ) w 2 j] + γ T$
• This is sum of T independent quadratic functions

The structure score

Two facts about single variable quadratic function:

arg min x G x + 1 2 H x 2 ⟺ x = - G H, min x G x + 1 2 H x 2 = - G 2 2 H

define

Gj=∑i∈Ijgj,Hj=∑i∈Ijhj,then objective becomes

O b j (t) = \sum j = 1 T [\sum i \in I j g i w j + 1 2 (\sum i \in I j h i + λ) w 2 j] + γ T = \sum j = 1 T [G i w j + 1 2 (H j + λ) w 2 j] + γ T

Assume the structure of tree q(x) is fixed, the optimal weight in each leaf, and the resulting objective value are:
$w * j = - G j H j + λ$
then objective becomes:
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$

EXAMPLE

here comes the example

这里写图片描述

search algorithm for single tree

enumerate the possible tree structures q(x)
Calculate the according structure score for the q(x), using the scoring function:
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$
find the best tree structure,and use the optimal leaf weight
$w * j = - G 2 j H j + λ$

But… there can be infinite possible tree structures..

so in practice we do not enumerate but greedly grow the tree:

start from tree with depth 0
For each leaf node of the tree, try to add a split. The change of
objective after adding the split is:
Gain=G2LHL+λ+G2RHR+λ−(GL+GR)2HL+HR+λ−γ
- the score of left/right child ：G2LHL+λ&G2RHR+λ
- the score of if we do not split ：(GL+GR)2HL+HR+λ
- The complexity cost by introducing additional leaf：γ
Remaining question: how do we find the best split?

Efficient Finding of the Best Split

first take a look at the Algorithm for Split Finding

Algorithm for Split Finding

For each node, enumerate over all features
For each feature, sorted the instances by feature value
Use a linear scan to decide the best split along that feature
Take the best split solution along all the features

Time Complexity growing a tree of depth K is O(ndKlogn),or, each level needO(nlogn)time to sort.
There are d features, and we need to do it for K level

This can be further optimized (e.g. use approximation or caching the sorted features)
Can scale to very large dataset

Let f and g be two functions defined on some subset of the real numbers. One writes

What about Categorical Variables?

Some tree learning algorithm handles categorical variable and continuous variable separately.Actually it is not necessary,
We can easily use the scoring formula we derived to score split based on categorical variables.We can encode the categorical variables into numerical vector
using one-hot encoding. Allocate a #categorical length vector
$z j = {1, 0, if x is in category j o t h e r s$
The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data

Pruning and Regularization

Pre-stopping
- Stop split if the best split have negative gain
- But maybe a split can benefit future splits..
Post-Prunning
- Grow a tree to maximum depth,
- recursively prune all the leaf splits with negative gain

Recap: Boosted Tree Algorithm

Add a new tree in each iteration
Beginning of each iteration, calculate gi,hi
Use the statistics to greedily grow a tree ft(x)
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$
Add ft(x) to the model ŷ (t)i=ŷ (t−1)i+ft(xi)
Usually, instead we do ŷ (t)i=ŷ (t−1)i+ϵft(xi)
ϵ is called step-size or shrinkage, usually set around 0.1
This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting

Questions 1

How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight?

Define objective, calculate, feed it to the old tree learning algorithm we have for un-weighted version
Again think of separation of model and objective, how does the theory can help better organizing the machine learning toolkit

Questions 2

Back to the time series problem, if I want to learn step functions over time. Is there other ways to learn the time splits, other than the top down split approach?

这里写图片描述

All that is important is the structure score of the splits
Obj=−12∑j=1TG2jHj+λ+γT
- Top-down greedy, same as trees
- Bottom-up greedy, start from individual points as each group, greedily merge neighbors
- Dynamic programming, can find optimal solution for this case

Summary

• The separation between model, objective, parameters can be helpful for us to understand and customize learning models
• The bias-variance trade-off applies everywhere, including learning in functional space
• We can be formal about what we learn and how we learn. Clear understanding of theory can be used to guide cleaner implementation.

0 0