This blog will talk about one of the most theoretically sound Machine Learning techniques called Kernel Methods which became popular along with its best known member the Support Vector Machines in the 1990s.

In Kernel theory we assume that learning happens in the RKHS space(Nice space of functions for non-parametric statistics and machine learning) and the theorem that forms the backbone for learning in RKHS is the Representer Theorem.

Before scaring you guys with RKHS and theorems right from the beginning, let me explain two main properties of Kernel methods.

Property 1: Kernel Methods can be thought of as instance-based learners: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead “remember” the \(i\)-th training example \(\mathbf{(x_i,y_i)}\) and learn for it a corresponding weight \(w_{i}\). This basically means functions(hyperplanes in SVM /basis functions in KPCA) learnt through Kernel Methods can be represented as a weighted linear combination of the training points and what the algorithm actually “learn” are these weights correspondning to each point. Representer Theorem provides explanation for this. Thus as per theorem the ** regularized risk functional**(basically the objective function of the optimization problem being solved) of any algorithm which is a member of the Kernel methods takes the following general form:

\[\hspace{1cm} \frac{1}{N}\sum_{i=1}^{N}C(y_i,f(x_i)) + \frac{\lambda}{2}\Omega(f)\]

\(\hspace{1cm} \Omega-\text{any monotonically increasing function}\) \(\hspace{1cm} C - \text{cost function}\) \(\hspace{1cm} f - \text{function that we intend to learn}\)

Property 2: Now comes the “Kernel Trick”. Kernel Methods through Kernel Functions allow you to perform the learning as if it were projected to a higher dimensional space, by operating on its original space. You can kernalize an algorithm by reformulating it in such a manner that the computations dependent only on the inner products of the data points than actual data points which then can be replaced by a Kernel Function(following the property of RKHS).

\[\hspace{2cm}\mathcal{k}(x,y) = \langle \phi(x),\phi(y) \rangle \hspace{1cm}\\]
  • “Why is it nice?”: Say you have a set of images of 32x32=1024 pixels. However this is not linearly separable. The option you have is to project it on to a higher dimensional space, lets say quadratic, $\mathcal{R}^{1024} \rightarrow \mathcal{R}^{1024x1024}$. Data manupulation in this space is highly expensive. Kernel functions allow us to do computations in the lower dimensional space, but effectively learning in the higher dimensional space.

Few Algorithms

  1. Support Vector Machines

  2. Kernel Ridge Regression

  3. Kernel PCA

Appendix

Reproducing Kernel Hilbert Space(RKHS): is a subspace of the Hilbert space with respect to kernel \(k: \mathcal{X}\times\mathcal{X} \rightarrow \mathcal{R}\) constructed in the following manner.

  • Let \(\mathcal{H'} = \\{ k(.x): x\epsilon \mathcal{X} \\}\) be set of kernel functions.
  • We construct a vector space \(\mathcal{H}\) using the linear combinations of all kernels functions in set \(\mathcal{H'}\), and any function \(f\epsilon \mathcal{H}\) can be represented as a linear combination of a subset of these kernel functions as \(f = \sum_{i=1}^n \alpha_ik(,x_i)\) for some \(n,x_i \epsilon \mathcal{X}\) and \(\alpha \epsilon \mathcal{R}\).

  • the inner product of \(f(.),g(.) \epsilon \mathcal{H}\) has the following definition: \(\langle f,g \rangle = \sum_{i=1}^n \sum_{j=1}^{n'} \alpha_i \beta_j k(x_i,x_j)\). (it can proven that this inner product is well defined)
  • this parculiar definition gives rise to the reproducing property of hilbert space. \(\langle f,k(.x) \rangle = \sum_{i=1}^n \alpha_i k(x,x_i) = f(x)\). This shows that kernel is a representer of evaluation(evaluation functional), analogous to Dirac delta functions.

  • What is it’s significance? : Enables Kernel Trick. While learning in RKHS, inner-products in our computations can be replaced by kernels \(\langle \phi(x),\phi(y) \rangle = \langle k(.,x),k(.,y) \rangle=k(x,y)\), where \(\phi\) maps \(x\epsilon \mathcal{X}\) to an infinite dimensional space, \(\phi : x \rightarrow k(.,x)\epsilon\mathcal{H}\). This is particularly useful in cases where data is not linearly separable in $ \mathcal{X} $, where tranformations to higher dimensional spaces are necessary and kernel trick avoids us in making this explicit transformations.

Representer Theorem: states that a minimizer \(f^{*}\) of a regularized empirical risk function defined over a Reproducing Kernel Hilbert Space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

  • Precise Definition: Let \(\Omega : [0,\infty) \rightarrow \mathcal{R}\) be a strictly a monotonically increasing function, by \(\mathcal{X}\) a set, and \(C : (\mathcal{X} × R^2)^N\) be an arbitrary loss function. Then any $ \mathcal{f} \epsilon $ RKHS \(\mathcal{F}\) minimizing the regularized risk functional

\(\frac{1}{N}\sum_{i=1}^{N}C(y_i,f(x_i)) + \frac{\lambda}{2}\Omega(f)\) admits a representation of the form \(\mathcal{f}(.) = \sum_{i=1}^{N}\alpha_i k_{x_i}\).

  • What is it’s significance?: Firstly it gave a genreric definition of optimization objective, that come under the umbrella of kernel methods. Secondly, Representer theorems[Smola et. all 2011] are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem. In most interesting applications, the search domain \(H_{k}\) for the minimization will be an infinite-dimensional subspace of \(L^{2}({\mathcal {X}})\), and therefore the search (as written) does not admit implementation on finite-memory and finite-precision computers. In contrast, the representation of \(f^{*}(\cdot )\) afforded by a representer theorem reduces the original (infinite-dimensional) minimization problem to a search for the optimal {\displaystyle n} n-dimensional vector of coefficients \(\alpha =( \alpha _{1},...,\alpha _{n})\in \mathbb {R} ^{n}\); \(\alpha\) can then be obtained by applying any standard function minimization algorithm. Consequently, representer theorems provide the theoretical basis for the reduction of the general machine learning problem to algorithms that can actually be implemented on computers in practice.

Tip

References

  1. Justin Domke - UMASS Notes
  2. Sumitra.S.Nair - IIST Notes
  3. NPTEL - P S Shastry -IISc
  4. Eric Xing - CMU Slides
  5. Wikipedia

Disclaimer

Please feel free to contact me vaisakhs.shaj@gmail.com in case you find any errors/suggestions, I will correct those promptly. You may as well comment below.