Introduction to Boosted Trees
来源:互联网 发布:淘宝刷话费教程 编辑:程序博客网 时间:2024/05/29 07:50
- Review of concepts of supervised learning
- objective function
- loss function LLTheta
- regularization OmegaTheta
- lasso
- ridge regression
- logistic regression
- boosted trees
- Regression treeCART
- Regression tree ensemble
- tree ensemble methods
- represent the RT in optimization method
- how to solve the optimization problem
- additive training boosting
- how we decide which f to add
- how to define F the space of functions containing all regression trees
- map concept of tree to optimization
- refine the definition of tree
- Define the Complexity of Tree
- revisit the instance set in leaf j as
- The structure score
- EXAMPLE
- search algorithm for single tree
- Efficient Finding of the Best Split
- Algorithm for Split Finding
- What about Categorical Variables
- Pruning and Regularization
- Recap Boosted Tree Algorithm
- Questions 1
- Questions 2
- Summary
Review of concepts of supervised learning
objective function
where
loss function L(Θ)
- square loss:
l(yi,ŷ i)=∥yi−ŷ i∥2 - logistic loss:
l(yi,ŷ i)=yiln(1+e−ŷ i)+ŷ iln(1+e−yi)
regularization Ω(Θ)
- l1 norm (lasso):
Ω(w)=λ∥w∥1 - l2 norm (ridge):
Ω(w)=λ∥w∥22
lasso
- linear model,square loss,l1 regularition
ridge regression
- linear model,square loss,l2 regularition
logistic regression
- linear model, logistic loss,l2 regulation
boosted trees
Regression tree(CART)
regression tree, or classification and regression tree,contains one score in each leaf value.
Regression tree ensemble
prediction of each menber is sum of scores predicted by each of the tree
tree ensemble methods
- mostly widely used ,such as GBM,random forest …
- invariante to scaling of inputs,so needn’t worry about feature normalization
- learn higehr order interaction between features
represent the RT in optimization method
model: assuming we have K trees
ŷ i=∑k=1Kfk(xi),fk(x)∈F - F is space of functions containing all regression trees.
- regression tree is a function that maps the attitute to the score.
- each regression tree
fk(x) is in charge of some parts of the attributes.
Parametres: struction of each tree
fk(x) ,and the score in the leaf
or simplyΘ={f1,…,fK} Objective
Obj(Θ)=∑i=1nl(yi,ŷ i)+∑k=1KΩ(fk)(2.3) - how to define to define
Ω ?- # nodes/depth
- l2 norm of the leaf weights
- ….
- how to define loss function
- square loss result in common gradient boosted machine
- logistic loss result in logitBoost
how to solve the optimization problem
additive training (boosting)
at training round 0,start from constant prediction, add a new function each time:
at traing round t,Keep functions added in previous round(
how we decide which f to add?
at round t,solve the optimization:
for lose function:
- square loss:
Obj(t) becomes a quadratic function offt , let the derivetive be 0 then… - other cases ,we can use the numeric ways:
- define
gi=∂ŷ (t−1)l(yi,ŷ (t−1)i) ,hi=∂2ŷ (t−1)l(yi,ŷ (t−1)i) - the taylor expansion of
Obj(t) :Obj(t)≈∑i=1n[l(yi,ŷ (t−1)i)+gift(xi)+12h2ift(xi)]+Ω(ft)+C - with constants removed,
Obj(t) becomes:∑i=1n[gift(xi)+12h2ift(xi)]+Ω(ft)
- define
so again we retain to a quadratic convex optimization.
how to define F: the space of functions containing all regression trees
map concept of tree to optimization
information gain -> train loss
pruning->regularization defined be #nodes
max depth -> constraint on the function space
smoothing leaf values -> l2 regularization on leaf weights
refine the definition of tree
- We define tree by a vector of scores in leafs, and a leaf index mapping function
q(x) that maps an instance to a leafft(x)=wq(x),w∈RT,q:Rd→{1,2,…,T} - T is # leaf
- q(x) means that sample falls into which leaf,such as 1,3
wy means the weight of leaf y
Define the Complexity of Tree
- Define complexity as (this is not the only possible definition)
Ω(ft)=γT+12λ∑j=1Tw2j
revisit the instance set in leaf j as
- Define the instance set in leaf j as
Ij={i|q(xi)=j} - Regroup the objective by each leaf
Obj(t)≈∑i=1n[gift(xi)+12h2ift(xi)]+Ω(ft)=∑i=1n[gift(xi)+12h2ift(xi)]+γT+12λ∑j=1Tw2j=∑j=1T[∑i∈Ijgiwj+12(∑i∈Ijhi+λ)w2j]+γT
• This is sum of T independent quadratic functions
The structure score
Two facts about single variable quadratic function:
define
- Assume the structure of tree
q(x) is fixed, the optimal weight in each leaf, and the resulting objective value are:w∗j=−GjHj+λ - then objective becomes:
Obj=−12∑j=1TG2jHj+λ+γT
EXAMPLE
here comes the example
search algorithm for single tree
- enumerate the possible tree structures q(x)
- Calculate the according structure score for the q(x), using the scoring function:
Obj=−12∑j=1TG2jHj+λ+γT - find the best tree structure,and use the optimal leaf weight
w∗j=−G2jHj+λ
But… there can be infinite possible tree structures..
so in practice we do not enumerate but greedly grow the tree:
- start from tree with depth 0
For each leaf node of the tree, try to add a split. The change of
objective after adding the split is:Gain=G2LHL+λ+G2RHR+λ−(GL+GR)2HL+HR+λ−γ - the score of left/right child :
G2LHL+λ &G2RHR+λ - the score of if we do not split :
(GL+GR)2HL+HR+λ - The complexity cost by introducing additional leaf:
γ
- the score of left/right child :
Remaining question: how do we find the best split?
Efficient Finding of the Best Split
first take a look at the Algorithm for Split Finding
Algorithm for Split Finding
- For each node, enumerate over all features
- For each feature, sorted the instances by feature value
- Use a linear scan to decide the best split along that feature
- Take the best split solution along all the features
Time Complexity growing a tree of depth K is
There are d features, and we need to do it for K level
- This can be further optimized (e.g. use approximation or caching the sorted features)
- Can scale to very large dataset
Let f and g be two functions defined on some subset of the real numbers. One writes
What about Categorical Variables?
- Some tree learning algorithm handles categorical variable and continuous variable separately.Actually it is not necessary,
- We can easily use the scoring formula we derived to score split based on categorical variables.We can encode the categorical variables into numerical vector
using one-hot encoding. Allocate a #categorical length vectorzj={1,0,if x is in category jothers - The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data
Pruning and Regularization
Pre-stopping
- Stop split if the best split have negative gain
- But maybe a split can benefit future splits..
Post-Prunning
- Grow a tree to maximum depth,
- recursively prune all the leaf splits with negative gain
Recap: Boosted Tree Algorithm
- Add a new tree in each iteration
- Beginning of each iteration, calculate
gi,hi - Use the statistics to greedily grow a tree
ft(x) Obj=−12∑j=1TG2jHj+λ+γT - Add 
ft(x) to the modelŷ (t)i=ŷ (t−1)i+ft(xi)
Usually, instead we doŷ (t)i=ŷ (t−1)i+ϵft(xi) ϵ is called step-size or shrinkage, usually set around 0.1
This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting
Questions 1
How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight?
- Define objective, calculate, feed it to the old tree learning algorithm we have for un-weighted version
- Again think of separation of model and objective, how does the theory can help better organizing the machine learning toolkit
Questions 2
Back to the time series problem, if I want to learn step functions over time. Is there other ways to learn the time splits, other than the top down split approach?
- All that is important is the structure score of the splits
Obj=−12∑j=1TG2jHj+λ+γT - Top-down greedy, same as trees
- Bottom-up greedy, start from individual points as each group, greedily merge neighbors
- Dynamic programming, can find optimal solution for this case
Summary
• The separation between model, objective, parameters can be helpful for us to understand and customize learning models
• The bias-variance trade-off applies everywhere, including learning in functional space
• We can be formal about what we learn and how we learn. Clear understanding of theory can be used to guide cleaner implementation.
- Introduction to Boosted Trees
- Introduction to Boosted Trees
- [阅读笔记]Introduction to Boosted Trees
- Boosted Trees
- XGBoost Boosted Trees的介绍
- Introduction to Binary Space Partition Trees
- Why do people use gradient boosted decision trees to do feature transform?
- (XGBoost)提升树入门介绍(Inrtoduction to Boosted Trees)
- Gradient Boosted Decision Trees(GBDT)详解
- Udacity----Introduction to Machine Learning----Decision Trees决策树
- Opencv2.4.9源码分析——Gradient Boosted Trees
- Spark MLlib实现的广告点击预测–Gradient-Boosted Trees
- 【DataStrcutre】Introduction and description of Binary Trees
- Trees Made to Order
- Trees Made to Order
- Introduction to CGI Variables
- Introduction to Rave Reports
- Introduction to RSS
- JSP开发过程遇到的中文乱码问题及解决方案
- 物理standby database的日常维护
- Qt5.5.1 +MinGw 打包、发布
- Java FTP上传文件
- ss扫描器 使用说明
- Introduction to Boosted Trees
- 一次完整的HTTP事务是怎样一个过程?
- Spring学习笔记之基础、IOC、DI(1)
- HttpClient的用法
- Java 序列化static变量问题
- iOS 如何学习新技术
- Android Studio 导入 Gradle 项目
- c++多重继承的构造执行顺序
- DOTween-Ease缓动函数