blank

Control As Inference In PGMs (Part 1) - Why is it Interesting?

2022-06-18T00:00:00+00:00

Motivation For This Blog

Probabilistic Graphical Models (PGMs) are powerful tools for representing relationships between random variables using graphs (directed or undirected). Once we come up with a representation of the stochastic phenomenon we wish to model, PGMs provide a consistent and flexible framework to devise principled objectives, set up models that reflect the causal structure in the world, and allow a common set of inference methods to be deployed against a broad range of problem domains.

This blog will try to summarize how Reinforcement Learning can be brought under the PGM framework which allows us to transfer policy search from an optimization point of view to an inference point of view. Also, we will see how this framework allows us to recover “soft” versions of Bellman backup equations from classical RL.

Premise :

Human or animal behaviour is not often perfectly optimal but approximately optimal. For example, an animal/human being whose only goal is to move from the start position to the goal position as shown in figure 1 can choose any choose either perfectly optimal/hard optimal (green) trajectory or suboptimal (blue) trajectory but typically would avoid the red trajectories which specify bad behaviour.

Figure 1: Stochastic and Suboptimal Human/Agent Behaviour

We need a probabilistic framework for this stochastic phenomenon of goal-directed behaviour. The framework, in addition to giving a higher probability to perfectly optimal behaviour, should also give a non-zero probability to suboptimal behaviour. Similarly, a near-zero probability should be assigned to bad behaviour that misses reaching the goal.

In machine learning whenever we have a stochastic phenomenon, we usually come up with a probabilistic graphical model (PGM) based on the observed stochastic phenomenon, such that the samples from this model will look like the observed stochastic phenomenon.

Thus we need a PGM that models optimal decision-making. Let’s first model a PGM that models the relationship between states, actions and next states as in figure 2.

Figure 2: PGM for physically consistent dynamics

This PGM in figure 2, commonly known as a state space model (SSM), can represent physically consistent trajectories but for control/decision making we need a notion of cost/reward/optimality. Thus we modify the PGM to introduce additional optimality variables into the PGM, which is defined as $p(O_t=1 \mid s_t,a_t) = exp(r(s_t,a_t))$.

Figure 3: PGM for optimal decision making

Thus $O_t$ is defined as a binary random variable which indicates if the behaviour at time t was optimal or not. The reason why we define this as the exponential of the reward function is to make a convenient distinction between optimal (best way to reach a goal), sub-optimal (reaches the goal but not optimally) and bad behaviour (fails to reach the goal). This will become more apparent in the next section.

Motivation For Why Control As Inference Is An Interesting Paradigm

Convenient Way For Representing and Sampling Optimal and Suboptimal Behaviour

Let’s derive a mathematical expression for the probability of a trajectory (state action sequence) given the observed optimality variables. We will see that based on the PGM we can derive an expression where optimal/sub-optimal trajectories can be conveniently represented.

\[\begin{aligned}p\left(\tau \mid\mathcal{O}_{1: T}\right) & = \frac{p\left(\tau, \mathcal{O}_{1: T}\right)}{p\left( \mathcal{O}_{1: T}\right)} \\ & \propto p\left(\tau, \mathcal{O}_{1: T}\right)\\ & = p\left(s_{1}\right) \prod_{t=1}^{T} p\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) p\left(\mathcal{O}_{t} \mid s_{t}, a_{t}\right) \\ &=p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right) \exp \left(r\left(s_{t}, a_{t}\right) \right) \\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right)}_\text{physical consistent dynamics with action prior}\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) \right)}_\text{exponential of sum of rewards}\\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)}_\text{physical consistent dynamics }\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) + p\left(a_{t} \mid s_{t}\right) \right)}_\text{exponential of sum of modified rewards} \end{aligned}\]

As we can see, those trajectories which are physically consistent and optimal (in terms of long-term rewards) have a higher probability mass. Additionally, a suboptimal trajectory with a slightly lesser reward can also be modelled/sampled using this graphical model framework which is important in several settings including inverse reinforcement learning.

Note: One could ignore the action prior or assume a uniform action prior to further simplify the equations.

2. Can use established inference schemes to answer several queries including:

Policy Search: Given a reward, infer the optimal policy by calculating $p(a_t \mid s_t, O_{t:T})$. Instead of solving the optimization problem, we now can solve the inference problem. This will be discussed in detail in part 2 of this blog. Further, an approximate inference scheme based on variational/optimization-based formulation is discussed in part 3 of this blog.
Inverse Reinforcement Learning: Given a collection of optimal trajectories, infer the reward and action priors, which is basically an inverse RL question.
\[\begin{aligned} p\left(\tau, \mathcal{O}_{1: T}, \theta, \phi\right) & \propto\left[p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)\right] \times \\ &\quad\quad\quad\quad\quad\quad\exp \left(\sum_{t=1}^{T} r_{\phi}\left(s_{t}, a_{t}\right)+\log p_{\theta}\left(a_{t} \mid s_{t}\right)\right) \\&=\left[p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)\right]\times \\ &\quad\quad\quad\quad\quad\quad\operatorname{exp}\left(\sum_{t=1}^{T} \phi^{T} f_{r}\left(s_{t}, a_{t}\right)+\log \theta^{T} f_{p}\left(a_{t} \mid s_{t}\right)\right) \end{aligned}\]

3. Allows to model stochastic behaviour which has several advantages

Transfer Learning: If we can model multiple ways to solve a particular task, this turns to be relevant for transfer learning in a new setting where the task has to be solved in a slightly different manner.
Better Exploration Strategies: We will see that the maximum extropy objective that we derive in the 3rd part of this blog series on Policy Search as Variational Inference will provide a natural exploration strategy based on entropy maximization.

The blog is based on the following reference, Levine, 2018.

Control As Inference In PGMs (Part 2) - Policy Search Via Exact Inference

2022-06-18T00:00:00+00:00

In the 1st part of the blog series on control as inference, we discussed why this paradigm is interesting. In the second part of the blog, we will discuss how we can treat policy search as exact inference in this graphical model via variable elimination.

We will see that the subroutines in the policy search procedure in this graphical model result in “soft” variations of bellman update equations, where the hard max in operation is replaced by a softmax.

Log Sum Exp Trick

A useful “trick” to remember before we jump into control as inference procedure is “Log Sum Exp” trick. The LogSumExp (LSE) (also called RealSoftMax or multivariable softplus) function is defined as the logarithm of the sum of the exponentials of the arguments:

\[\operatorname{LSE}\left(x_{1}, \ldots, x_{n}\right)=\log \left(\exp \left(x_{1}\right)+\cdots+\exp \left(x_{n}\right)\right)\]

The LSE function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms. In the following inference procedure, we will replace LSE with max/softmax to derive a soft version of classical RL.

Exact Inference By Recursive Computation Of Backward Messages

At any time t policy search involves computing the posterior over the action $a_t$, given state $s_t$ and the optimality variables $O_{t:T}$, i.e.

Thus policy search involves computing two backward messages $\color{orange} \beta_{t}\left(\mathbf{s}_{t}\right)$ and $\color{purple} \beta_{t}\left(\mathbf{s}_{t},\mathbf{a}_{t}\right)$. This is computed via backward messages similar to HMM or Kalman Smoothers as follows:

At the last time step T:

Note on Action Prior: Here $p(a_t \mid s_t)$ is the action prior. Note that it is not conditioned on $O_{1:T}$ in any way, i.e. it does not denote the probability of an optimal action, but simply the prior probability of actions. The PGM for RL as inference doesn’t actually contain this factor, and we can assume that $ p\left(a_t \mid s_t \right)= \frac{1}{ \mid \mathcal{A} \mid} $ for simplicity. That is, it is a constant corresponding to a uniform distribution over the set of actions.

Thought Exercise:

Levine 2018 assumes a uniform action prior and argues that this assumption does not introduce any loss of generality. One could show that any non-uniform action prior $p(a_t \mid s_t)$ can be incorporated into $p(O_t \mid s_t,a_t):=exp(r_1(s_t,a_t))$ via a modified reward function $r_1(s_t,a_t)$.

At any time step t

Intuition / Relationship To Classical RL

In order to get an intuitive meaning of these messages, we do some algebraic manipulation in the log space to get a form similar to bellman backup. The messages in log space are as follows:

Equation 1: Messages in log space

One can show that if we replace the $\log \beta(s_t,a_t)$ with Q function and $\log \beta(s_t)$ with the value function we can recover an intuitive relationship resembling bellman backup operator in the deterministic case and an optimistic bellman backup in the stochastic case. Let,

\[\begin{aligned} \color{purple} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t} \right) &= \log \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t} \right) \\ \color{orange}V\left(\mathbf{s}_{t}\right) &= \log \beta_{t}\left(\mathbf{s}_{t} \right) \end{aligned}\]

Equation 2: Relationship between backward messages and Q/Value functions in classical RL.

Using Equations 1 and 2, now the messages in the log space look as follows:

Deterministic Case (Regular Bellman Backup)

In the deterministic case since we only have one possibility for transition dynamics, we obtain a backup equation similar to the regular Bellman backup equation.

\[\color{purple}Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=\color{black}r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\color{orange}V\left(\mathbf{s}_{t+1}\right)\]

Stochastic Case (Optimistic Backup)

In the stochastic case we obtain a backup equation that is optimistic as shown below:

\[\color{purple}Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=\color{black}r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\exp \left(\color{orange}V\left(\mathbf{s}_{t+1}\color{black}\right)\right)\right]\]

The optimistic update occurs because it is largely determined by the max of the next state value, which creates risk-seeking behaviour.

This issue will be mitigated by variational inference discussed in the next part of this blog series.

Control As Inference In PGMs (Part 3) - Policy Search Via Variational Inference

2022-06-18T00:00:00+00:00

In the 3rd part of this blog, we will discuss another paradigm, where policy search is reframed as an optimization problem via approximate inference. We will see that this formulation allows us to make a distinction between controllable and non-controllable blocks in the graphical model and thus avoid the optimistic Bellman Update we obtained in part 2 of this blog series. In addition, we arrive at a max entropy RL objective which is critical for exploration and learning diverse skills.

Which objective does the inference procedure in Exact Inference in Part 2 solve?

The inference procedure discussed in part 2 of this blog series solve the following objective: $\color{red} \text{minimize} \quad D_{\mathrm{KL}}(\color{green} q_\phi(\tau)\color{red} | \color{orange} p(\tau)\color{red}) = \color{red} \text{minimize} \quad D_{\mathrm{KL}}(\color{green} q_\phi(s_{1:T},a_{1:T}) \color{red} | \color{orange} p(s_{1:T},a_{1:T},O_{1:T})\color{red})$

Here the join distribution of optimal trajecories is given as follows:

\[\color{orange} p(\tau) = \color{black}\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\]

Which variational distribution to choose ??

Looking at the graphical model for the variational distribution, the joint distribution for $q(\tau)$ should be $q(\tau)=q\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)$.

Here unlike in the exact inference case, we make an explicit assumption on what part of the graphical model is controllable by agent and what is not. It is reasonable to assume that the transition dynamics is not controllable by the agent and hence we fix $q\left(\mathbf{s}_{1}\right)=p\left(\mathbf{s}_{1}\right) \text { and } q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$.

Derivation Of Max Entropy RL Objective

In can be shown that minimizing this optimization objective results in max entorpy reinforcement learning objective as derived below:

\[\begin{aligned} &\min KL \left( \color{green}q(\tau) \| \color{orange}p(\tau)\right) =\max -E_{\color{green}q(\tau)} \log \frac{\color{green}q(\tau)}{\color{orange}p(\tau)} \\ &=\max E_{\color{green}q(\tau)}-\color{green} \log p\left(s_{0}\right)-\sum_{t=1}^{T} \log p(s_{t+1}\mid s_{t},a_{t})-\sum_{t=1}^{T} \log \pi_{\phi}\left(a_{t} \mid s_{t}\right) \\ & \quad\quad\quad\color{orange}+ \log p\left(s_{0}\right) + \sum_{t=1}^{T} \log p\left(s_{t+1} \mid s_{t},a_{t}\right)+\sum_{t=1}^{T} \log p\left(O_{t} \mid s_{t}, a_{t}\right) \\ &=\max \underset{q(\tau)}{E}\left[\color{orange} \sum_{t=1}^{T} \log (\exp (r(s_{t}, a_{t}))\color{green}-\sum_{t=1}^{T} \log \pi\left(a_{t} \mid s_{t})\right.\right]. \\ &=\max \underbrace{\underset{q(\tau)}{E}\left[\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right)\right]}_{\text{reward maximization}}+\underbrace{\sum_{t=1}^{T} H(\pi\left(a_{t} \mid s_{t})\right)}_{\text{conditional entropy maximization}}\end{aligned}\]

Deriving Soft Bellman Equations

We now look at message passing (backward messages) from an optimization point of view. To calculate the backward messages we start from the last time step.

At the last time step T

However, note that here we consider a general scenario where the reward can take any real value, $-\infty < r(s,a) < \infty$ , as opposed to the earlier restriction to be negative or zero. Thus we need to normalize $\exp(\log(r(s_T,a_T)))$, using the normalizing constant $V(s_T)=\int_{\mathbb{A}}\exp(r(s_T,a_T)) da_T$.

Thus we do a little bit more algebraic manipulation to include this normalization constant as follows:

The optimal policy that minimizes this objective is given as :

\[\begin{aligned} \color{green}\pi^*\left(\mathbf{a}_{T} \mid \mathbf{s}_{T}\right)&=\exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-V\left(\mathbf{s}_{T}\right)\right)\\ \color{green}V\left(\mathbf{s}_{T}\right)&=\log \int_{\mathcal{A}} \exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) d \mathbf{a}_{T}\\ &\approx\underset{\mathbb{A}}{softmax}\left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) \end{aligned}\]

At any time step t,

The optimal policy that minimizes this objective at any time step t is given as :

\[\begin{aligned} \color{green}\pi^*\left(\mathbf{a}_{T} \mid \mathbf{s}_{T}\right)&=\exp \left(Q\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-V\left(\mathbf{s}_{T}\right)\right)\\ \color{green}Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)&=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\\ \color{green}V\left(\mathbf{s}_{T}\right)&=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) d \mathbf{a}_{T}\\ &\approx\underset{\mathbb{A}}{softmax}\left(Q\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) \end{aligned}\]

This means that, if we fix the dynamics and initial state distribution, and only allow the policy to change, we recover a Bellman backup operator that uses the expected value of the next state, rather than the optimistic estimate we saw in part 2 of the blog series. Thus we avoid the risk seeking behaviour / optimistic bellman backups via the control as inference framework.

We will discuss how this framework is used practically in modern Deep RL alogorithms in the next part of this blog series.

Meta Learning and MAML(Model Aganostic Meta Learning)

2018-08-15T00:00:00+00:00

This blogpost will talk about Meta Learning and a very intuitive Meta Learning Algorithm namely Model Aganostic Meta Learning(MAML) and a couple of it’s variants.

Meta Learning (“Learning To Learn”)

Many problems of interest require rapid inference from small quantities of data. We should adopt learning strategies such that the single/few observations should result in abrupt shifts in behavior(single/few shot learning). When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference.

For example a robot designed to clean nuclear wastes, encounters a wide variety of tasks with many objects. It should be able to amortize their experience from previous learnt skills and improve data efficiency in acquiring new skills rather than Learning each skill from scratch.

Another example would be malware vs clean classification task, where a classification engine has to adapt to the new malware variants being released each day. You would typically have few example per class for a new day(here the new task is classifying between malware and clean files encountered that particular day) and requires meta-learning for “few shot classification”. (This would be an interesting application of meta learning to try out)

Inspired from nature ? This kind of flexible adaptation is a celebrated aspect of human learning (Jankowski et al., 2011), manifesting in settings ranging from motor control (Braun et al., 2009) to the acquisition of abstract concepts (Lake et al., 2015). Generating novel behavior based on inference from a few scraps of information – e.g., inferring the full range of applicability for a new word, heard in only one or two contexts – is something that has remained stubbornly beyond the reach of contemporary machine intelligence.

Why not Deep Learning ? In situations when only a few training examples are presented one-by-one, a straightforward gradient-based solution like Deep Neural Network is to completely re-learn the parameters from the data available at the moment. Such a strategy is prone to poor learning, and/or catastrophic interference. In view of these hazards, non-parametric methods are often considered to be better suited.

General Meta Learning Framework

Meta-learning generally refers to a scenario in which an agent learns at two levels, each associated with different time scales. Rapid learning/Task Specific Learning occurs within a task, for example, when learning to accurately classify within a particular dataset. This learning is guided by knowledge accrued more gradually across tasks, which captures the way in which task structure varies across target domains. Given its two-tiered organization, this form of meta learning is often described as “learning to learn”. This task aganostic learning helps in quick adaptations to new tasks.

This means during the meta-learning we need to provide the learning algorithm an information(encoding) about the task/context in some form, along with the training samples asossiated with each task. So how do we feed data to a meta-learner ?

During meta-learning we provide a support/context $D_{support}$ along with the test sample.

What is $D_{support}$?: $D_{support}$ can be defined as a set of $(x_i,y_i)$ tuples where $x_i$ is input/observation and $y_i$ is the output/reward. $D_{support}$ provides a context(information about task to be performed) with respect to the test sample. $D_{support}$ usually tends to have less number of samples(few shot) per class.

Model Aganostic Meta Learning

The two key features of MAML are,

1) it makes gradient based solutions good at “few shot learning”. 2) it is model aganostic in the sense that it can be applied to any learning algorithm/model(classification, regression and reinforcment learning) that uses a gradient descent based optimization.

Note: There are recent papers that extends maml for UnSupervised and Semi-Supervised tasks.

As discussed earlier this meta learning procedure has 2 stages.

In the task specific update stage, corresponding to each task the model parameters are updated using the task specific dataset.

Update 1 \[ \theta_i^{*}=\theta + \alpha\times \nabla_\theta L_i(\theta,y_i) \]

Next is the meta update stage, where we try to find a genaralist structure among these tasks such that updated parameter is always closer to each of the task specific optimal parameter, $\theta_i$(See Figure 1). In effect the parameter gets updated in the meta update in such a manner that it is one(few) steps away from doing well at each one of the tasks. This is achieved by optimizing for the performance of $f_{\theta_i^*}$ with respect to $\theta$ across the tasks as follows:

Update 2 \[ \begin{aligned} \theta = & \theta + \beta\times\sum_i^NL_i(f_{\color{red}{\theta_i^*}},y_i)
= &\theta + \beta \nabla_\theta \sum_i^NL_i(f_{\color{red}{\theta + \alpha\times \nabla_\theta L_i(\theta,y_i)}},y_i) \end{aligned} \]

Another way of looking at MAML: The meta optimization is performed over the model parameters $\theta$ but the objective is computed using the updated models $f_{\theta_i^*}$. Thus we update $\theta$ such that it learns to get a low classification error/high reward in the next step rather than this step. This makes the model parameter $\theta$ to be more sensitive to changes in tasks such that a small changes in the parameter can produce large improvements in the direction of the task with respect to which it is tuned for.

Does MAML falls in the general meta learning framework ?: Yes. It does. The task specific learning happens in update 1 and meta-learning happens in update 2.

How to train? “You get good at what you practice”. Thus the training time protocol used is same as the testing time protocol, as in O. Vinyal’s(2016). Split each data/demonstrations for each individual task into training and validation pair. Use one of these for Update 1(task specific learning) and the other for Update 2(meta-learning).

An Illustration

Unsupervised Extension

Refer these One-Shot Visual Imitation Learning via Meta-Learning , Semi-Supervised Few-Shot Learning with MAML. I will write on these shortly.

References

C. Finn et al. 2017 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
C. Finn et al. 2017 One-Shot Visual Imitation Learning via Meta-Learning
Semi-Supervised Few-Shot Learning with MAML
O. Vinyals, 2016, Matching Networks for One Shot Learning

Kernel Methods in Machine Learning

2018-07-07T00:00:00+00:00

This blog will talk about one of the most theoretically sound Machine Learning techniques called Kernel Methods which became popular along with its best known member the Support Vector Machines in the 1990s.

In Kernel theory we assume that learning happens in the RKHS space(Nice space of functions for non-parametric statistics and machine learning) and the theorem that forms the backbone for learning in RKHS is the Representer Theorem.

Before scaring you guys with RKHS and theorems right from the beginning, let me explain two main properties of Kernel methods.

Property 1: Kernel Methods can be thought of as instance-based learners: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead “remember” the $i$-th training example $\mathbf{(x_i,y_i)}$ and learn for it a corresponding weight $w_{i}$. This basically means functions(hyperplanes in SVM /basis functions in KPCA) learnt through Kernel Methods can be represented as a weighted linear combination of the training points and what the algorithm actually “learn” are these weights correspondning to each point. Representer Theorem provides explanation for this. Thus as per theorem the ** regularized risk functional**(basically the objective function of the optimization problem being solved) of any algorithm which is a member of the Kernel methods takes the following general form:

\[\hspace{1cm} \frac{1}{N}\sum_{i=1}^{N}C(y_i,f(x_i)) + \frac{\lambda}{2}\Omega(f)\]

$\hspace{1cm} \Omega-\text{any monotonically increasing function}$ $\hspace{1cm} C - \text{cost function}$ $\hspace{1cm} f - \text{function that we intend to learn}$

Property 2: Now comes the “Kernel Trick”. Kernel Methods through Kernel Functions allow you to perform the learning as if it were projected to a higher dimensional space, by operating on its original space. You can kernalize an algorithm by reformulating it in such a manner that the computations dependent only on the inner products of the data points than actual data points which then can be replaced by a Kernel Function(following the property of RKHS).

\[\hspace{2cm}\mathcal{k}(x,y) = \langle \phi(x),\phi(y) \rangle \hspace{1cm}\\]

“Why is it nice?”: Say you have a set of images of 32x32=1024 pixels. However this is not linearly separable. The option you have is to project it on to a higher dimensional space, lets say quadratic, $\mathcal{R}^{1024} \rightarrow \mathcal{R}^{1024x1024}$. Data manupulation in this space is highly expensive. Kernel functions allow us to do computations in the lower dimensional space, but effectively learning in the higher dimensional space.

Few Algorithms

Support Vector Machines
Kernel Ridge Regression
Kernel PCA

Appendix

Reproducing Kernel Hilbert Space(RKHS): is a subspace of the Hilbert space with respect to kernel $k: \mathcal{X}\times\mathcal{X} \rightarrow \mathcal{R}$ constructed in the following manner.

Let $\mathcal{H'} = \\{ k(.x): x\epsilon \mathcal{X} \\}$ be set of kernel functions.
We construct a vector space $\mathcal{H}$ using the linear combinations of all kernels functions in set $\mathcal{H'}$, and any function $f\epsilon \mathcal{H}$ can be represented as a linear combination of a subset of these kernel functions as $f = \sum_{i=1}^n \alpha_ik(,x_i)$ for some $n,x_i \epsilon \mathcal{X}$ and $\alpha \epsilon \mathcal{R}$.
the inner product of $f(.),g(.) \epsilon \mathcal{H}$ has the following definition: $\langle f,g \rangle = \sum_{i=1}^n \sum_{j=1}^{n'} \alpha_i \beta_j k(x_i,x_j)$. (it can proven that this inner product is well defined)
this parculiar definition gives rise to the reproducing property of hilbert space. $\langle f,k(.x) \rangle = \sum_{i=1}^n \alpha_i k(x,x_i) = f(x)$. This shows that kernel is a representer of evaluation(evaluation functional), analogous to Dirac delta functions.
What is it’s significance? : Enables Kernel Trick. While learning in RKHS, inner-products in our computations can be replaced by kernels $\langle \phi(x),\phi(y) \rangle = \langle k(.,x),k(.,y) \rangle=k(x,y)$, where $\phi$ maps $x\epsilon \mathcal{X}$ to an infinite dimensional space, $\phi : x \rightarrow k(.,x)\epsilon\mathcal{H}$. This is particularly useful in cases where data is not linearly separable in $ \mathcal{X} $, where tranformations to higher dimensional spaces are necessary and kernel trick avoids us in making this explicit transformations.

Representer Theorem: states that a minimizer $f^{*}$ of a regularized empirical risk function defined over a Reproducing Kernel Hilbert Space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Precise Definition: Let $\Omega : [0,\infty) \rightarrow \mathcal{R}$ be a strictly a monotonically increasing function, by $\mathcal{X}$ a set, and $C : (\mathcal{X} × R^2)^N$ be an arbitrary loss function. Then any $ \mathcal{f} \epsilon $ RKHS $\mathcal{F}$ minimizing the regularized risk functional

$\frac{1}{N}\sum_{i=1}^{N}C(y_i,f(x_i)) + \frac{\lambda}{2}\Omega(f)$ admits a representation of the form $\mathcal{f}(.) = \sum_{i=1}^{N}\alpha_i k_{x_i}$.

What is it’s significance?: Firstly it gave a genreric definition of optimization objective, that come under the umbrella of kernel methods. Secondly, Representer theorems[Smola et. all 2011] are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem. In most interesting applications, the search domain $H_{k}$ for the minimization will be an infinite-dimensional subspace of $L^{2}({\mathcal {X}})$, and therefore the search (as written) does not admit implementation on finite-memory and finite-precision computers. In contrast, the representation of $f^{*}(\cdot )$ afforded by a representer theorem reduces the original (infinite-dimensional) minimization problem to a search for the optimal {\displaystyle n} n-dimensional vector of coefficients $\alpha =( \alpha _{1},...,\alpha _{n})\in \mathbb {R} ^{n}$; $\alpha$ can then be obtained by applying any standard function minimization algorithm. Consequently, representer theorems provide the theoretical basis for the reduction of the general machine learning problem to algorithms that can actually be implemented on computers in practice.

Tip

References

Disclaimer

Please feel free to contact me vaisakhs.shaj@gmail.com in case you find any errors/suggestions, I will correct those promptly. You may as well comment below.

DDPG: Deep Deterministic Policy Gradients

2018-06-30T00:00:00+00:00

This blogpost will talk about Deep Deterministic Policy Gradients.

The popular DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces.

To extend DQN to continuous control tasks, we need to perform an extra expensive optimization procedure in the inner loop of training(during the bellman update stage) as shown below:

\[Q(s_t,a_t) = r(s_t,a_t) + \gamma \times Q(s_{t+1},\mathbin{\color{red}{\underset{a'}{max}Q(s_{t+1},a')}})\]

One may use SGD to solve this but it turns out to be quite slow. The other solutions includes sampling actions, CEM, CMA-ES etc , which often works for action dimensions upto 40. Another option is to use Easily maximizable Q Functions, where you gain computational efficiency and simplicity but loose representational power. DDPG is the third approach and most widely used out of its simplicity in implementation, and similarity to well known topics like Q Learning and Policy Gradients.

DDPG: can be interpreted in terms of Q Learning and Policy Gradients Literature. In terms of Q Learning, it tends to use a function approximator for solving the max in the Bellman Equation of Q Learning(approximate Q Learning Method).

This is achieved by learning a deterministic policy $\mu_{\theta}(s)$ using an actor neural net and updating the policy parameters by moving them in the direction of the gradient of the action-value function.

\begin{equation} \begin{aligned} \theta :=\quad& \theta\quad + \quad \alpha\times\underset{j}{\sum}\nabla_\theta Q(s_j,\mu_\theta(s)) \newline =\quad & \theta \quad + \quad \alpha\times\underset{j}{\sum}\nabla_{\mu_\theta} Q(s_j,\mu_\theta(s_j)) \times \nabla_{\theta}\mu_\theta(s_j) \newline =\quad & \theta \quad+\quad\alpha\times\underset{j}{\sum}\nabla_{a} Q(s_j,a)\times\nabla_{\theta}\mu_\theta(s_j) \end{aligned} \end{equation}

The above update rule is a special limiting case of the Policy Gradient theorem and its convergence to a locally optimal policy is proven. This is the policy gradient explanation of the algorithm and hence the name DD-Policy Gradient(For more details refer [2]).

You can find the implementation of it on a HalfCheetah Environment using OpenAI-BASELINES here.

References

Sutton et al. 1998Policy Gradient Methods for RL with Function Approximation
Silver et al. 2016 Deterministic Policy Gradient Algorithms
DDPG Blog