Control As Inference In PGMs (Part 1) - Why is it Interesting?

Introduction and motivation to control as inference paradigm

Motivation For This Blog

Probabilistic Graphical Models (PGMs) are powerful tools for representing relationships between random variables using graphs (directed or undirected). Once we come up with a representation of the stochastic phenomenon we wish to model, PGMs provide a consistent and flexible framework to devise principled objectives, set up models that reflect the causal structure in the world, and allow a common set of inference methods to be deployed against a broad range of problem domains.

This blog will try to summarize how Reinforcement Learning can be brought under the PGM framework which allows us to transfer policy search from an optimization point of view to an inference point of view. Also, we will see how this framework allows us to recover “soft” versions of Bellman backup equations from classical RL.

Premise :

Human or animal behaviour is not often perfectly optimal but approximately optimal. For example, an animal/human being whose only goal is to move from the start position to the goal position as shown in figure 1 can choose any choose either perfectly optimal/hard optimal (green) trajectory or suboptimal (blue) trajectory but typically would avoid the red trajectories which specify bad behaviour.

Figure 1: Stochastic and Suboptimal Human/Agent Behaviour

We need a probabilistic framework for this stochastic phenomenon of goal-directed behaviour. The framework, in addition to giving a higher probability to perfectly optimal behaviour, should also give a non-zero probability to suboptimal behaviour. Similarly, a near-zero probability should be assigned to bad behaviour that misses reaching the goal.

In machine learning whenever we have a stochastic phenomenon, we usually come up with a probabilistic graphical model (PGM) based on the observed stochastic phenomenon, such that the samples from this model will look like the observed stochastic phenomenon.

Thus we need a PGM that models optimal decision-making. Let’s first model a PGM that models the relationship between states, actions and next states as in figure 2.

Figure 2: PGM for physically consistent dynamics

This PGM in figure 2, commonly known as a state space model (SSM), can represent physically consistent trajectories but for control/decision making we need a notion of cost/reward/optimality. Thus we modify the PGM to introduce additional optimality variables into the PGM, which is defined as $p(O_t=1 \mid s_t,a_t) = exp(r(s_t,a_t))$.

Figure 3: PGM for optimal decision making

Thus $O_t$ is defined as a binary random variable which indicates if the behaviour at time t was optimal or not. The reason why we define this as the exponential of the reward function is to make a convenient distinction between optimal (best way to reach a goal), sub-optimal (reaches the goal but not optimally) and bad behaviour (fails to reach the goal). This will become more apparent in the next section.

Motivation For Why Control As Inference Is An Interesting Paradigm

  1. Convenient Way For Representing and Sampling Optimal and Suboptimal Behaviour

Let’s derive a mathematical expression for the probability of a trajectory (state action sequence) given the observed optimality variables. We will see that based on the PGM we can derive an expression where optimal/sub-optimal trajectories can be conveniently represented.

\[\begin{aligned}p\left(\tau \mid\mathcal{O}_{1: T}\right) & = \frac{p\left(\tau, \mathcal{O}_{1: T}\right)}{p\left( \mathcal{O}_{1: T}\right)} \\ & \propto p\left(\tau, \mathcal{O}_{1: T}\right)\\ & = p\left(s_{1}\right) \prod_{t=1}^{T} p\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) p\left(\mathcal{O}_{t} \mid s_{t}, a_{t}\right) \\ &=p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right) \exp \left(r\left(s_{t}, a_{t}\right) \right) \\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)p\left(a_{t} \mid s_{t}\right)}_\text{physical consistent dynamics with action prior}\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) \right)}_\text{exponential of sum of rewards}\\ &=\left[\underbrace{p\left(s_{1}\right) \prod_{t=1}^{T} p\left(s_{t+1} \mid s_{t}, a_{t}\right)}_\text{physical consistent dynamics }\right] \underbrace{\exp \left(\sum_{t=1}^{T} r\left(s_{t}, a_{t}\right) + p\left(a_{t} \mid s_{t}\right) \right)}_\text{exponential of sum of modified rewards} \end{aligned}\]

As we can see, those trajectories which are physically consistent and optimal (in terms of long-term rewards) have a higher probability mass. Additionally, a suboptimal trajectory with a slightly lesser reward can also be modelled/sampled using this graphical model framework which is important in several settings including inverse reinforcement learning.

Note: One could ignore the action prior or assume a uniform action prior to further simplify the equations.

2. Can use established inference schemes to answer several queries including:

3. Allows to model stochastic behaviour which has several advantages

The blog is based on the following reference, Levine, 2018.