Revisit Policy Optimization in Matrix Form
In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as V_π = (I - γ P_π)^-1r_π, where P_π is the state transition matrix given policy π and r_π is the reward signal given π. What annoys us is that P_π and r_π are both mixed with π, which means every time when we update π, they will change together. In this paper, we leverage the notation from wang2007dual to disentangle π and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem sutton2018reinforcement and TRPO schulman2015trust can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.
READ FULL TEXT