Multi-task Learning

来源:互联网 发布:淘宝老七贸易苹果7 编辑:程序博客网 时间:2024/06/06 17:31

基于Supervised Learning Lecture 8

  • Multi-task learning
    • Mathematical formulation
  • Linear MTL
    • Regularisers for linear MTL
      • Quadratic regulariser
      • Structured sparsity
  • Clustered MTL
  • Further topics
    • Transferring to new tasks
      • Case of the variance regulariser
      • Informal reasoning
  • Take home message

Multi-task learning

  • Multi-task learning (MTL) is an approach to machine learning that learns a problem together with other related problems at the same time, using a shared representation. 1
  • The goal of MTL is to improve the performance of learning algorithms by learning classifiers for multiple tasks jointly.
  • Typical scenario: many tasks many tasks but only few examples per task. If n<d we don’t have enough data to learn the tasks one by one. However, if the tasks are related and set S or the associated regularizer captures such relationships in a simple way, learning the tasks jointly greatly improves over independent task learning (ITL).
  • When problems (tasks) are closely related, learning in parallel can be more efficient than learning tasks independently. Also, this often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks.
  • Applications: Learning a set of linear classi ers for related objects
    (cars, lorries, bicycles), user modelling, multiple object detection in scenes, affective computing, bioinformatics, health informatics, marketing science, neuroimaging, NLP, speech…
  • Further categorisation is possible, e.g. hierarchical models, clustering of tasks.
  • The ideas can be extended to non-linear cases through RKHS.

Mathematical formulation

  • Fix probability measures μ1,,μT on Rd×R
    – T tasks
    – Each task is a probability measure, e.g. μt(x,y)=P(x)δ(w,xy). δ is a deterministic function, interpreted as the conditional probability and wx is an underlying parameter
    Rd can also be a Hilbert space
  • Draw data: (xt1vector,yt1scalar),,(xtn,ytn)μt,t=1,,T (in practice n may vary with t)
  • Learning method:

    min(f1,,fT)F1Tt=1T1ni=1n(yti,ft(xti))


    where F is a set of vector-value functions. A standard choice is a ball in a RKHS, which models interactions between the tasks in the sense that functions with small norm have strongly related components.
  • Goal is to minimise the multi-task error

    R(f1,,fT)=1Tt=1TE(x,y)μt(yti,ft(xti))

Linear MTL

  • “task” = “linear model”
    – Regression: yti=wt,xti+ϵti
    – Binary classification: yti=sign(wt,xti)ϵti
  • Learning method: min(w1,,wT)S1TTt=11nni=1(yti,wt,xti). Here, S incorporates the prior knowledge about the regression vector and encourages “common structure” among tasks, e.g. the ball of a matrix norm or other regulariser.
  • The multitask error of W=[w1,,wT] is: R(W)=1TTt=1E(x,y)μt(yti,wt,x)
  • It is possible to give bounds on the uniform deviation
    supWS{R(W)1Tt=1T1ni=1n(yti,wt,xti)}

    and derive bounds for excess error
    R(W^)minWSR(W)

Regularisers for linear MTL

Often we drop the constraint (i.e. WS) and consider the penalty methods

minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λΩ(w1,,wT)

Different regularisers encourage different types of commonalities between the tasks:

  • variance (or other convex quadratic regularisers) encourage closeness to mean
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)
  • Joint sparsity (or other structured sparsity regularisers) encourage few shared variables
    ||W||2,1:=j=1dt=1Tw2tj
  • Trace norm (or other spectral regularisers which promote low rank solutions) encourage few shared features
    ||w1,,wT||tr

    – extension of joint sparsity; rotate the initial data representation
    – The l1 norm of SVD of this matrix is bounded, so favour low-rank representation (i.e. common low-dimensional subspace)
  • More sophisticated regularisers which combine the above, promote clustering of tasks, etc.

Quadratic regulariser

  • general quadratic regulariser
    Ωvar=s,t=1Tws,Estwt

    where the matrix E=(Est)Ts,t=1RdT×dT is positive definite.
    E
  • variance regulariser
    Let γ[0,1] and
    Ωvar=1Tt=1T||wt||2+1γγVar(w1,,wT)=1Tt=1T||wt||2+1γγt=1T||wtw¯||22

    γ=1: independent tasks; γ=0: identical tasks
    – regulariser favours weight vectors which are close to its mean.
    – If we are working on SVM with hinge loss, the objective function is a compromise between maximising individual margins and minimising the variance (i.e. keeping the tasks close to each other)
  • Link to the kernel methods (quadratic regulariser)
    The problem
    minw1,,wT1Tt=1T1ni=1n(yti,wt,xti)+λs,t=1Tws,Estwt

    is equivalent to

    minv1Tt=1T1ni=1n(yti,v,Btxti)+λv,v(1)


    where Bt are p×d matrices (typically pd) linked to E by E=(BTB)1,Bdim=p×dT=[B1,,BT]concatenate by columns and wt=(Bt)Tvt
    Interpretation:
    – We learn a single function (x,t)ft(x) using the feature map (x,t)Bt(x) and corresponding multitask kernel K((x1,t1),(x2,t2))=Bt1x1,Bt2x2
    – Writing v,Btx=BTtv,x, we interpret this as having a single regression vector which is transformed by matrix Bt to obtain the task specific weight vector.
  • Link to the kernel methods (variance regulariser)
    The problem
    minw1,,wT1Tnt,i(yti,wt,xti)+λ(1Tt=1T||wt||2+1γγVar(w1,,wT))

    is equivalent to
    minw0,u1,,uT1Tnt,i(yti,w0+ut,xti)+λ(1γTt=1T||ut||2+11γ||w0||2)(2)

    by setting wt=w0+ut and minimise over w0.
    It is of the form (1) with
    vBTtdim=(T+1)d×d=((1γ)12w0,(γT)12u1,,(γT)12uT)=[1γId×d,0d×d,,0d×dt-1,γTId×d,0d×d,,0d×dT-t]

    and the corresponding kernel K((x1,t1),(x2,t2))=(1γ+γTδt1t2)x1,x2
    By writing (2) as the following, it is more apparent that we regularise around some common vector w0
    minw01Tt=1Tminw{1ni=1n(yti,w,xti)+λγ||ww0||2}+λ1γ||w0||2
  • More multitask kernels

Structured sparsity

  • general sparsity regulariser
    ||W||2,1:=j=1dt=1Tw2tj

    – sum of the l2 norm of the row of matrix
    – encourages a matrix has only a few non-zero rows
    – regression vectors are sparse, but the sparsity pattern is contained in a small cardinality
    structured sparsity

Clustered MTL

Further topics

Transferring to new tasks

  • Having found a feature map h, to test it on the environment we
    1) draw a task μE
    2) draw a sample zμn
    3) run the algorithm to obtain a(h)z=f^h,zh
    4) measure the loss of a(h)z on a random pair (x,y)μ
  • The error associated with the algorithm a(h) is
    Rn(h)=EμEEzμnE(x,y)μ[(a(h)z(x),y)]
  • The best value for a representation h given complete knowledge of the environment is then
    minhHRn(h)
  • Compare to the very best we can do:

    R=minhHEμE[minfFE(x,y)μ(f(h(x)),y)]

  • The excess error associated with h is then Rn(h)R

Case of the variance regulariser

  • Training
    minw01Tt=1Tminw{1ni=1n(yti,w,xti+λγ||ww0||2}+λ1γ||w0||2
  • Testing
    minw1ni=1n(yi,w,xi)+λγ||ww0||2
  • Error
    Rn(w0)=EμEEzμnE(x,y)μ(y,w0+wz,x)
  • Best we can do
    R=minw0EμE[minwE(x,y)μ(y,w0+w,x)]
  • Excess error of w0: Rn(w0)R

Informal reasoning

The feature map B learned from the training tasks can be used to learn a new task more quickly (a kind of bias learning heuristic).

  • Learn a new task by the method
    minv{1ni=1n(yt,v,Bxi)+λ2||v||22}
  • Give more weight to important features. In particular, if some eigenvalues of G=BB are zero, the corresponding eigenvectors are discarded when learning a new task.
  • In the case of diagonal matrices, some elements may be zero which results in a decreased number of parameters to learn.
  • A statistical justification of an approach similar to this based on dictionary learning can be given.

Take home message

  • MLT objective function
  • regulariser
  • link to kernel trick


  1. Multi-task learning, wikipedia
    https://en.wikipedia.org/wiki/Multi-task_learning ↩
0 0