Is functional analysis relevant to machine learning?

来源：互联网发布：android电商app源码编辑：程序博客网时间：2024/06/04 19:57

From Quora

One place where functional analysis is particularly relevant to machine learning is the study of kernel methods, a notable example of which is the kernel Support Vector Machine, where the theory of Reproducing kernel Hilbert spaces (RKHS) from functional analysis plays a big role.

Every positive definite kernel

(on any type of data) uniquely defines a Hilbert space $\mathcal{H}$ , called the RKHS with reproducing kernel

, which satisfies a set of properties, and in particular provides a 'feature map' $\phi$ from the original space to the RKHS, for which the kernel corresponds to an inner product: $K(x,y) = \langle \phi(x), \phi(y) \rangle_\mathcal{H}$ (in fact, we have $\phi(x) = K(x,\cdot)$ ).

One can study the properties of such Hilbert spaces, which can be infinite-dimensional even when the input space on which the kernel is defined isn't. For example, the RKHS of a Gaussian kernel is infinite-dimensional, the RKHS of the min kernel $K(x,y) = \min(x,y)$ is a a Hilbert space similar to a Sobolev space (the inner product between two functions in the space is the integral of the product of their derivatives), and the polynomial kernel gives a space of polynomials.

One interesting property, which can be easily shown using the 'reproducing property' of the RKHS and the Cauchy-Schwarz inequality, is that for a function $f \in \mathcal{H}$ ,
$|f(x) - f(y)| \leq \|f\|_\mathcal{H} \|\phi(x) - \phi(y)\|_\mathcal{H}$ ,
i.e.

is Lipschitz with constant $\|f\|_\mathcal{H}$ : the variations of the function depend on the variations in the input space with respect to the geometry defined by the map $\phi$ . Basically the RKHS norm directly relates to the smoothness of the function (the smaller the norm, the smaller the variations).

These RKHS define function spaces, and it turns out one can optimize certain problems on these spaces (e.g. find the function in the space which gives the smallest error in an empirical risk minimization problem), using a key result called the representer theorem. The theorem states that if the objective only depends on the evaluations of the function on a set of

points

, and increases with the RKHS norm of the function, the optimal function will be in the linear span of the mapped functions in the RKHS, $K(x_i,\cdot)$ , thus reducing the problem to an optimization on $\mathbb{R}^n$ , which is much easier. The condition on the RKHS norm is easy to verify if you add this norm as a 'regularizer' in your objective, and this will have the benefit of controlling the smoothness (the 'complexity') or your function.

As an example, say you want to learn a regression function

in some RKHS $\mathcal{H}$ from a set of training points

, then you can assume the function is of the form $f(x) = \sum_i \alpha_i K(x_i, x)$ , and you're left with an optimization problem on the vector $\alpha$ , whose dimensionality is equal to the size of your training set, even though $\mathcal{H}$ might be of infinite dimensionality! This is typically what happens in kernel SVMs, and it is part of the reason why they've been originally so successful.

0 0