Multi-task Learning

来源:互联网 发布:淘宝老七贸易苹果7 编辑:程序博客网 时间:2024/06/06 17:31

基于Supervised Learning Lecture 8

  • Multi-task learning
    • Mathematical formulation
  • Linear MTL
    • Regularisers for linear MTL
      • Quadratic regulariser
      • Structured sparsity
  • Clustered MTL
  • Further topics
    • Transferring to new tasks
      • Case of the variance regulariser
      • Informal reasoning
  • Take home message

Multi-task learning

  • Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
  • The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
  • Typical scenario: many tasks many tasks but only few examples per task. If n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set S or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
  • When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
  • Applications: Learning a set of linear classi ers for related objects
    (cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
  • Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
  • The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

  • Fix probability measures μ1,,μT on Rd×R
    – T tasks
    – Each task is a probability measure, e.g. μt(x,y)=P(x)δ(w,xy). δ is a deterministic function, interpreted as the conditional probability and wx is an underlying parameter
    Rd can also be a Hilbert space
  • Draw data: (xt1vector,yt1scalar),,(xtn,ytn)μt,t=1,,T (in practice n may vary with t)
  • Learning method:


    where F is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
  • Goal is to minimise the multi-task error


Linear MTL

  • “task” = “linear model”
    – Regression: yti=wt,xti+ϵti
    – Binary classification: yti=sign(wt,xti)ϵti
  • Learning method: min(w1,,wT)S1TTt=11nni=1(yti,wt,xti). Here, S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
  • The multitask error of W=[w1,,wT] is: R(W)=1TTt=1E(x,y)μt(yti,wt,x)
  • It is possible to give bounds on the uniform deviation

    and derive bounds for excess error

Regularisers for linear MTL

Often we drop the constraint (i.e. WS) and consider the penalty methods


Different regularisers encourage different types of commonalities between the tasks:

  • variance (or other convex quadratic regularisers) encourage closeness to mean
  • Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
  • Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features

    – extension of joint sparsity; rotate the initial data representation
    – The l1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
  • More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

  • general quadratic regulariser

    where the matrix E=(Est)Ts,t=1RdT×dT is positive definite.
  • variance regulariser
    Let γ[0,1] and

    γ=1: independent tasks; γ=0: identical tasks
    – regulariser favours weight vectors which are close to its mean.
    – If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
  • Link to the kernel methods (quadratic regulariser)
    The problem

    is equivalent to


    where Bt are p×d matrices (typically pd) linked to E by E=(BTB)1,Bdim=p×dT=[B1,,BT]concatenate by columns and wt=(Bt)Tvt
    – We learn a single function (x,t)ft(x) using the feature map (x,t)Bt(x) and corresponding multitask kernel K((x1,t1),(x2,t2))=Bt1x1,Bt2x2
    – Writing v,Btx=BTtv,x, we interpret this as having a single regression vector which is transformed by matrix Bt to obtain the task specific weight vector.
  • Link to the kernel methods (variance regulariser)
    The problem

    is equivalent to

    by setting wt=w0+ut and minimise over w0.
    It is of the form (1) with

    and the corresponding kernel K((x1,t1),(x2,t2))=(1γ+γTδt1t2)x1,x2
    By writing (2) as the following, it is more apparent that we regularise around some common vector w0
  • More multitask kernels

Structured sparsity

  • general sparsity regulariser

    – sum of the l2 norm of the row of matrix
    – encourages a matrix has only a few non-zero rows
    – regression vectors are sparse, but the sparsity pattern is contained in a small cardinality
    structured sparsity

Clustered MTL

Further topics

Transferring to new tasks

  • Having found a feature map h, to test it on the environment we
    1) draw a task μE
    2) draw a sample zμn
    3) run the algorithm to obtain a(h)z=f^h,zh
    4) measure the loss of a(h)z on a random pair (x,y)μ
  • The error associated with the algorithm a(h) is
  • The best value for a representation h given complete knowledge of the environment is then
  • Compare to the very best we can do:


  • The excess error associated with h is then Rn(h)R

Case of the variance regulariser

  • Training
  • Testing
  • Error
  • Best we can do
  • Excess error of w0: Rn(w0)R

Informal reasoning

The feature map B learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

  • Learn a new task by the method
  • Give more weight to important features. In particular, if some eigenvalues of G=BB are zero, the corresponding eigenvectors are discarded when learning a new task.
  • In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
  • A statistical justification of an approach similar to this based on dictionary learning can be given.

Take home message

  • MLT objective function
  • regulariser
  • link to kernel trick

  1. Multi-task learning, wikipedia ↩
0 0