Introduction and motivation to control as inference paradigm
Probabilistic Graphical Models (PGMs) are powerful tools for representing relationships between random variables using graphs (directed or undirected). Once we come up with a representation of the stochastic phenomenon we wish to model, PGMs provide a consistent and flexible framework to devise principled objectives, set up models that reflect the causal structure in the world, and allow a common set of inference methods to be deployed against a broad range of problem domains.
This blog will try to summarize how Reinforcement Learning can be brought under the PGM framework which allows us to transfer policy search from an optimization point of view to an inference point of view. Also, we will see how this framework allows us to recover “soft” versions of Bellman backup equations from classical RL.
Human or animal behaviour is not often perfectly optimal but approximately optimal. For example, an animal/human being whose only goal is to move from the start position to the goal position as shown in figure 1 can choose any choose either perfectly optimal/hard optimal (green) trajectory or suboptimal (blue) trajectory but typically would avoid the red trajectories which specify bad behaviour.
We need a probabilistic framework for this stochastic phenomenon of goal-directed behaviour. The framework, in addition to giving a higher probability to perfectly optimal behaviour, should also give a non-zero probability to suboptimal behaviour. Similarly, a near-zero probability should be assigned to bad behaviour that misses reaching the goal.
In machine learning whenever we have a stochastic phenomenon, we usually come up with a probabilistic graphical model (PGM) based on the observed stochastic phenomenon, such that the samples from this model will look like the observed stochastic phenomenon.
Thus we need a PGM that models optimal decision-making. Let’s first model a PGM that models the relationship between states, actions and next states as in figure 2.
This PGM in figure 2, commonly known as a state space model (SSM), can represent physically consistent trajectories but for control/decision making we need a notion of cost/reward/optimality. Thus we modify the PGM to introduce additional optimality variables into the PGM, which is defined as $p(O_t=1 \mid s_t,a_t) = exp(r(s_t,a_t))$.
Thus $O_t$ is defined as a binary random variable which indicates if the behaviour at time t was optimal or not. The reason why we define this as the exponential of the reward function is to make a convenient distinction between optimal (best way to reach a goal), sub-optimal (reaches the goal but not optimally) and bad behaviour (fails to reach the goal). This will become more apparent in the next section.
Let’s derive a mathematical expression for the probability of a trajectory (state action sequence) given the observed optimality variables. We will see that based on the PGM we can derive an expression where optimal/sub-optimal trajectories can be conveniently represented.
\[\begin{aligned}p\left(\tau \mid\mathcal{O}_{1: T}\right) & = \frac{p\left(\tau, \mathcal{O}_{1: T}\right)}{p\left( \mathcal{O}_{1: T}\right)} \\ & \propto p\left(\tau, \mathcal{O}_{1: T}\right)\\ & = p\left(s_{1}\right) \prod_{t=1}^{T} p\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) p\left(\mathcal{O}_{t} \mid s_{t}, a_{t}\right) \\ &=p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right) \exp \left(r\left(s_{t}, a_{t}\right) \right) \\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right)}_\text{physical consistent dynamics with action prior}\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) \right)}_\text{exponential of sum of rewards}\\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)}_\text{physical consistent dynamics }\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) + p\left(a_{t} \mid s_{t}\right) \right)}_\text{exponential of sum of modified rewards} \end{aligned}\]As we can see, those trajectories which are physically consistent and optimal (in terms of long-term rewards) have a higher probability mass. Additionally, a suboptimal trajectory with a slightly lesser reward can also be modelled/sampled using this graphical model framework which is important in several settings including inverse reinforcement learning.
Note: One could ignore the action prior or assume a uniform action prior to further simplify the equations.
2. Can use established inference schemes to answer several queries including:
Policy Search: Given a reward, infer the optimal policy by calculating $p(a_t \mid s_t, O_{t:T})$. Instead of solving the optimization problem, we now can solve the inference problem. This will be discussed in detail in part 2 of this blog. Further, an approximate inference scheme based on variational/optimization-based formulation is discussed in part 3 of this blog.
Inverse Reinforcement Learning: Given a collection of optimal trajectories, infer the reward and action priors, which is basically an inverse RL question.
\[\begin{aligned} p\left(\tau, \mathcal{O}_{1: T}, \theta, \phi\right) & \propto\left[p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)\right] \times \\ &\quad\quad\quad\quad\quad\quad\exp \left(\sum_{t=1}^{T} r_{\phi}\left(s_{t}, a_{t}\right)+\log p_{\theta}\left(a_{t} \mid s_{t}\right)\right) \\&=\left[p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)\right]\times \\ &\quad\quad\quad\quad\quad\quad\operatorname{exp}\left(\sum_{t=1}^{T} \phi^{T} f_{r}\left(s_{t}, a_{t}\right)+\log \theta^{T} f_{p}\left(a_{t} \mid s_{t}\right)\right) \end{aligned}\]3. Allows to model stochastic behaviour which has several advantages
Transfer Learning: If we can model multiple ways to solve a particular task, this turns to be relevant for transfer learning in a new setting where the task has to be solved in a slightly different manner.
Better Exploration Strategies: We will see that the maximum extropy objective that we derive in the 3rd part of this blog series on Policy Search as Variational Inference will provide a natural exploration strategy based on entropy maximization.
The blog is based on the following reference, Levine, 2018.