Multi-task Learning
来源:互联网 发布:淘宝老七贸易苹果7 编辑:程序博客网 时间:2024/06/06 17:31
基于Supervised Learning Lecture 8
- Multi-task learning
- Mathematical formulation
- Linear MTL
- Regularisers for linear MTL
- Quadratic regulariser
- Structured sparsity
- Regularisers for linear MTL
- Clustered MTL
- Further topics
- Transferring to new tasks
- Case of the variance regulariser
- Informal reasoning
- Transferring to new tasks
- Take home message
Multi-task learning
- Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
- The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
- Typical scenario: many tasks many tasks but only few examples per task. If
n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and setS or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL). - When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
- Applications: Learning a set of linear classiers for related objects
(cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech… - Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
- The ideas can be extended to non-linear cases through RKHS.
Mathematical formulation
- Fix probability measures
μ1,⋯,μT onRd×R
– T tasks
– Each task is a probability measure, e.g.μt(x,y)=P(x)δ(⟨w∗,x⟩−y) .δ is a deterministic function, interpreted as the conditional probability andwx is an underlying parameter
–Rd can also be a Hilbert space - Draw data:
(xt1vector,yt1scalar),⋯,(xtn,ytn)∼μt,t=1,⋯,T (in practice n may vary with t) - Learning method:
min(f1,⋯,fT)∈F1T∑t=1T1n∑i=1nℓ(yti,ft(xti))
whereF is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components. - Goal is to minimise the multi-task error
R(f1,⋯,fT)=1T∑t=1TE(x,y)∼μtℓ(yti,ft(xti))
Linear MTL
- “task” = “linear model”
– Regression:yti=⟨w∗t,xti⟩+ϵti
– Binary classification:yti=sign(⟨w∗t,xti⟩)ϵti - Learning method:
min(w1,⋯,wT)∈S1T∑Tt=11n∑ni=1ℓ(yti,⟨w∗t,xti⟩) . Here,S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser. - The multitask error of
W=[w1,⋯,wT] is:R(W)=1T∑Tt=1E(x,y)∼μtℓ(yti,⟨wt,x⟩) - It is possible to give bounds on the uniform deviation
supW∈S{R(W)−1T∑t=1T1n∑i=1nℓ(yti,⟨wt,xti⟩)}
and derive bounds for excess errorR(W^)−minW∈SR(W)
Regularisers for linear MTL
Often we drop the constraint (i.e.
Different regularisers encourage different types of commonalities between the tasks:
- variance (or other convex quadratic regularisers) encourage closeness to mean
Ωvar=1T∑t=1T||wt||2+1−γγVar(w1,⋯,wT) - Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
||W||2,1:=∑j=1d∑t=1Tw2tj−−−−−−⎷ - Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
||w1,⋯,wT||tr
– extension of joint sparsity; rotate the initial data representation
– Thel1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace) - More sophisticated regularisers which combine the above, promote clustering of tasks, etc.
Quadratic regulariser
- general quadratic regulariser
Ωvar=∑s,t=1T⟨ws,Estwt⟩
where the matrixE=(Est)Ts,t=1∈RdT×dT is positive definite. - variance regulariser
Letγ∈[0,1] andΩvar=1T∑t=1T||wt||2+1−γγVar(w1,⋯,wT)=1T∑t=1T||wt||2+1−γγ∑t=1T||wt−w¯||22
–γ=1 : independent tasks;γ=0 : identical tasks
– regulariser favours weight vectors which are close to its mean.
– If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other) - Link to the kernel methods (quadratic regulariser)
The problemminw1,⋯,wT1T∑t=1T1n∑i=1nℓ(yti,⟨wt,xti⟩)+λ∑s,t=1T⟨ws,Estwt⟩
is equivalent to
minv1T∑t=1T1n∑i=1nℓ(yti,⟨v,Btxti⟩)+λ⟨v,v⟩(1)
whereBt arep×d matrices (typicallyp≫d ) linked toE byE=(BTB)−1,Bdim=p×dT=[B1,⋯,BT]concatenate by columns andwt=(Bt)Tvt
Interpretation:
– We learn a single function(x,t)↦ft(x) using the feature map(x,t)↦Bt(x) and corresponding multitask kernelK((x1,t1),(x2,t2))=⟨Bt1x1,Bt2x2⟩
– Writing⟨v,Btx⟩=⟨BTtv,x⟩ , we interpret this as having a single regression vector which is transformed by matrixBt to obtain the task specific weight vector. - Link to the kernel methods (variance regulariser)
The problemminw1,⋯,wT1Tn∑t,iℓ(yti,⟨wt,xti⟩)+λ(1T∑t=1T||wt||2+1−γγVar(w1,⋯,wT))
is equivalent tominw0,u1,⋯,uT1Tn∑t,iℓ(yti,⟨w0+ut,xti⟩)+λ(1γT∑t=1T||ut||2+11−γ||w0||2)(2)
by settingwt=w0+ut and minimise overw0 .
It is of the form (1) withvBTtdim=(T+1)d×d=((1−γ)−12w0,(γT)−12u1,⋯,(γT)−12uT)=[1−γ−−−−√Id×d,0d×d,⋯,0d×dt-1,γT−−−√Id×d,0d×d,⋯,0d×dT-t]
and the corresponding kernelK((x1,t1),(x2,t2))=(1−γ+γTδt1t2)⟨x1,x2⟩
By writing (2) as the following, it is more apparent that we regularise around some common vectorw0 minw01T∑t=1Tminw{1n∑i=1nℓ(yti,⟨w,xti⟩)+λγ||w−w0||2}+λ1−γ||w0||2 - More multitask kernels
Structured sparsity
- general sparsity regulariser
||W||2,1:=∑j=1d∑t=1Tw2tj−−−−−−⎷
– sum of thel2 norm of the row of matrix
– encourages a matrix has only a few non-zero rows
– regression vectors are sparse, but the sparsity pattern is contained in a small cardinality
Clustered MTL
Further topics
Transferring to new tasks
- Having found a feature map
h , to test it on the environment we
1) draw a taskμ∼E
2) draw a samplez∼μn
3) run the algorithm to obtaina(h)z=f^h,z∘h
4) measure the loss ofa(h)z on a random pair(x,y)∼μ - The error associated with the algorithm
a(h) isRn(h)=Eμ∼EEz∼μnE(x,y)∼μ[ℓ(a(h)z(x),y)] - The best value for a representation
h given complete knowledge of the environment is thenminh∈HRn(h) Compare to the very best we can do:
R∗=minh∈HEμ∼E[minf∈FE(x,y)∼μℓ(f(h(x)),y)] The excess error associated with
h is thenRn(h)−R∗
Case of the variance regulariser
- Training
minw01T∑t=1Tminw{1n∑i=1nℓ(yti,⟨w,xti⟩+λγ||w−w0||2}+λ1−γ||w0||2 - Testing
minw1n∑i=1nℓ(yi,⟨w,xi⟩)+λγ||w−w0||2 - Error
Rn(w0)=Eμ∼EEz∼μnE(x,y)∼μℓ(y,⟨w0+wz,x)⟩ - Best we can do
R∗=minw0Eμ∼E[minwE(x,y)∼μℓ(y,⟨w0+w,x)⟩] - Excess error of
w0 :Rn(w0)−R∗
Informal reasoning
The feature map
- Learn a new task by the method
minv{1n∑i=1nℓ(yt,⟨v,B∗xi⟩)+λ2||v||22} - Give more weight to important features. In particular, if some eigenvalues of
G=B∗B are zero, the corresponding eigenvectors are discarded when learning a new task. - In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
- A statistical justification of an approach similar to this based on dictionary learning can be given.
Take home message
- MLT objective function
- regulariser
- link to kernel trick
- Multi-task learning, wikipedia
https://en.wikipedia.org/wiki/Multi-task_learning ↩
0 0
- Multi-task Learning
- Multi-task Learning(未完成)
- Multi-task Learning
- online multi-task learning 新思路
- Learning Task Grouping and Overlap in Multi-Task Learning
- 多任务学习方法( Multi-task learning )介绍
- 多任务学习(Multi-task learning)
- 两个Multi-task learning(多任务学习)的代码
- Multi-task learning(多任务学习)简介
- Facial landmark detection by deep multi-task learning
- 论文笔记《HyperFace: A Deep Multi-task Learning Framework ...》
- Facial Landmark Detection by Deep Multi-task Learning
- Accelerated Gradient Method for Multi-Task Sparse Learning Problem[2009]
- 17.3.13 多任务学习 Multi-task learning
- 多任务学习(Multi-Task Learning, MTL)
- 多任务学习(Multi-task learning)-1
- 【论文笔记】Adversarial Multi-task Learning for Text Classification
- Multi-task
- 双系统环境的搭建
- Androidstudio中编译版本和java版本冲突问题
- sqlserver多线程查询与更新
- ReactiveCocoa 学习笔记 - 使用篇
- 目前的状态!
- Multi-task Learning
- servlet总结
- phantomjs安装
- 机器学习的特征选择(feature selection)
- jQuery应用总结(一):DOM与事件
- 野指针和字符的遍历
- SpringAOP
- 安卓底层开发学习经验第十九期
- 自动类型转换、强制类型转换、作用域、整型表数范围