Multi-task Learning

来源：互联网发布：淘宝老七贸易苹果7 编辑：程序博客网时间：2024/06/06 17:31

基于Supervised Learning Lecture 8

Multi-task learning
- Mathematical formulation
Linear MTL
- Regularisers for linear MTL
  - Quadratic regulariser
  - Structured sparsity
Clustered MTL
Further topics
- Transferring to new tasks
  - Case of the variance regulariser
  - Informal reasoning
Take home message

Multi-task learning

Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
Typical scenario: many tasks many tasks but only few examples per task. If n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set S or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
Applications: Learning a set of linear classiers for related objects
(cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

Fix probability measures μ1,⋯,μT on Rd×R
– T tasks
– Each task is a probability measure, e.g. μt(x,y)=P(x)δ(⟨w∗,x⟩−y). δ is a deterministic function, interpreted as the conditional probability and wx is an underlying parameter
–Rd can also be a Hilbert space
Draw data: (xt1vector,yt1scalar),⋯,(xtn,ytn)∼μt,t=1,⋯,T (in practice n may vary with t)
Learning method:

$min (f 1, \dots, f T) \in F 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, f t (x t i))$

where F is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
Goal is to minimise the multi-task error

$R (f 1, \dots, f T) = 1 T \sum t = 1 T E (x, y) \sim μ t ℓ (y t i, f t (x t i))$

Linear MTL

“task” = “linear model”
– Regression: yti=⟨w∗t,xti⟩+ϵti
– Binary classification: yti=sign(⟨w∗t,xti⟩)ϵti
Learning method: min(w1,⋯,wT)∈S1T∑Tt=11n∑ni=1ℓ(yti,⟨w∗t,xti⟩). Here, S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
The multitask error of W=[w1,⋯,wT] is: R(W)=1T∑Tt=1E(x,y)∼μtℓ(yti,⟨wt,x⟩)
It is possible to give bounds on the uniform deviation
$sup W \in S {R (W) - 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩)}$
and derive bounds for excess error
$R (W^) - min W \in S R (W)$

Regularisers for linear MTL

Often we drop the constraint (i.e. W∈S) and consider the penalty methods

min w 1, \dots, w T 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩) + λ Ω (w 1, \dots, w T)

Different regularisers encourage different types of commonalities between the tasks:

variance (or other convex quadratic regularisers) encourage closeness to mean
$Ω var = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T)$
Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
$| | W | | 2, 1 : = \sum j = 1 d \sum t = 1 T w 2 t j - - - - - -  ⎷  $
Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
$| | w 1, \dots, w T | | t r$
– extension of joint sparsity; rotate the initial data representation
– The l1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

general quadratic regulariser
$Ω var = \sum s, t = 1 T ⟨ w s, E s t w t ⟩$
where the matrix E=(Est)Ts,t=1∈RdT×dT is positive definite.
variance regulariser
Let γ∈[0,1] and
$Ω var = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T) = 1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ \sum t = 1 T | | w t - w ¯ | | 22$
– γ=1: independent tasks; γ=0: identical tasks
– regulariser favours weight vectors which are close to its mean.
– If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
Link to the kernel methods (quadratic regulariser)
The problem
$min w 1, \dots, w T 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ w t, x t i ⟩) + λ \sum s, t = 1 T ⟨ w s, E s t w t ⟩$
is equivalent to

$min v 1 T \sum t = 1 T 1 n \sum i = 1 n ℓ (y t i, ⟨ v, B t x t i ⟩) + λ ⟨ v, v ⟩ (1)$

where Bt are p×d matrices (typically p≫d) linked to E by E=(BTB)−1,Bdim=p×dT=[B1,⋯,BT]concatenate by columns and wt=(Bt)Tvt
Interpretation:
– We learn a single function (x,t)↦ft(x) using the feature map (x,t)↦Bt(x) and corresponding multitask kernel K((x1,t1),(x2,t2))=⟨Bt1x1,Bt2x2⟩
– Writing ⟨v,Btx⟩=⟨BTtv,x⟩, we interpret this as having a single regression vector which is transformed by matrix Bt to obtain the task specific weight vector.
Link to the kernel methods (variance regulariser)
The problem
$min w 1, \dots, w T 1 T n \sum t, i ℓ (y t i, ⟨ w t, x t i ⟩) + λ (1 T \sum t = 1 T | | w t | | 2 + 1 - γ γ V a r (w 1, \dots, w T))$
is equivalent to
$min w 0, u 1, \dots, u T 1 T n \sum t, i ℓ (y t i, ⟨ w 0 + u t, x t i ⟩) + λ (1 γ T \sum t = 1 T | | u t | | 2 + 1 1 - γ | | w 0 | | 2) (2)$
by setting wt=w0+ut and minimise over w0.
It is of the form (1) with
$v B T t d i m = (T + 1) d \times d = ((1 - γ) - 1 2 w 0, (γ T) - 1 2 u 1, \dots, (γ T) - 1 2 u T) = [1 - γ - - - - \sqrt I d \times d, 0 d \times d, \dots, 0 d \times d                t-1, γ T - - - \sqrt I d \times d, 0 d \times d, \dots, 0 d \times d                T-t]$
and the corresponding kernel K((x1,t1),(x2,t2))=(1−γ+γTδt1t2)⟨x1,x2⟩
By writing (2) as the following, it is more apparent that we regularise around some common vector w0
$min w 0 1 T \sum t = 1 T min w {1 n \sum i = 1 n ℓ (y t i, ⟨ w, x t i ⟩) + λ γ | | w - w 0 | | 2} + λ 1 - γ | | w 0 | | 2$
More multitask kernels

Structured sparsity

general sparsity regulariser
$| | W | | 2, 1 : = \sum j = 1 d \sum t = 1 T w 2 t j - - - - - -  ⎷  $
– sum of the l2 norm of the row of matrix
– encourages a matrix has only a few non-zero rows
– regression vectors are sparse, but the sparsity pattern is contained in a small cardinality

Clustered MTL

Further topics

Transferring to new tasks

Having found a feature map h, to test it on the environment we
1) draw a task μ∼E
2) draw a sample z∼μn
3) run the algorithm to obtain a(h)z=f^h,z∘h
4) measure the loss of a(h)z on a random pair (x,y)∼μ
The error associated with the algorithm a(h) is
$R n (h) = E μ \sim E E z \sim μ n E (x, y) \sim μ [ℓ (a (h) z (x), y)]$
The best value for a representation h given complete knowledge of the environment is then
$min h \in H R n (h)$
Compare to the very best we can do:
$R * = min h \in H E μ \sim E [min f \in F E (x, y) \sim μ ℓ (f (h (x)), y)]$
The excess error associated with h is then Rn(h)−R∗

Case of the variance regulariser

Training
$min w 0 1 T \sum t = 1 T min w {1 n \sum i = 1 n ℓ (y t i, ⟨ w, x t i ⟩ + λ γ | | w - w 0 | | 2} + λ 1 - γ | | w 0 | | 2$
Testing
$min w 1 n \sum i = 1 n ℓ (y i, ⟨ w, x i ⟩) + λ γ | | w - w 0 | | 2$
Error
$R n (w 0) = E μ \sim E E z \sim μ n E (x, y) \sim μ ℓ (y, ⟨ w 0 + w z, x) ⟩$
Best we can do
$R * = min w 0 E μ \sim E [min w E (x, y) \sim μ ℓ (y, ⟨ w 0 + w, x) ⟩]$
Excess error of w0: Rn(w0)−R∗

Informal reasoning

The feature map B learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

Learn a new task by the method
$min v {1 n \sum i = 1 n ℓ (y t, ⟨ v, B * x i ⟩) + λ 2 | | v | | 22}$
Give more weight to important features. In particular, if some eigenvalues of G=B∗B are zero, the corresponding eigenvectors are discarded when learning a new task.
In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
A statistical justification of an approach similar to this based on dictionary learning can be given.

Take home message

MLT objective function
regulariser
link to kernel trick

Multi-task learning, wikipedia
https://en.wikipedia.org/wiki/Multi-task_learning ↩

0 0