Deriving max entropy RL objective and soft bellman backup equations via variational inference in PGM
In the 3rd part of this blog, we will discuss another paradigm, where policy search is reframed as an optimization problem via approximate inference
The inference procedure discussed in part 2 of this blog series solve the following objective:
Here the join distribution of optimal trajecories is given as follows:
Looking at the graphical model for the variational distribution, the joint distribution for
Here unlike in the exact inference case, we make an explicit assumption on what part of the graphical model is controllable by agent and what is not. It is reasonable to assume that the transition dynamics is not controllable by the agent and hence we fix
In can be shown that minimizing this optimization objective results in max entorpy reinforcement learning objective as derived below:
We now look at message passing (backward messages) from an optimization point of view. To calculate the backward messages we start from the last time step.
At the last time step T
However, note that here we consider a general scenario where the reward can take any real value,
Thus we do a little bit more algebraic manipulation to include this normalization constant as follows:
The optimal policy that minimizes this objective is given as :
At any time step t,
The optimal policy that minimizes this objective at any time step t is given as :
This means that, if we fix the dynamics and initial state distribution, and only allow the policy to change, we recover a Bellman backup operator that uses the expected value of the next state, rather than the optimistic estimate we saw in part 2 of the blog series. Thus we avoid the risk seeking behaviour / optimistic bellman backups via the control as inference framework.
We will discuss how this framework is used practically in modern Deep RL alogorithms in the next part of this blog series.