Connection between Likelihood ratio policy gradient method and Importance sampling method.

Preliminaries

An infinite-horizon discounted Markov Decision Process (MDP) is defined as the tuple $(\mathcal{S},\mathcal{A},P,r,\rho_0,\gamma)$, where

  • $\mathcal{S}$ is a finite set of states, or state space.
  • $\mathcal{A}$ is a finite set of actions, or action space.
  • $P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to\mathbb{R}$ is the transition probability distribution, i.e. $P(s,a,s’)=P(s’\vert s,a)$ denotes the probability of transitioning to state $s’$ when taking action $a$ from state $s$.
  • $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is the reward function.
  • $\rho_0:\mathcal{S}\to\mathbb{R}$ is the distribution of the initial state $s_0$.
  • $\gamma\in(0,1)$ is the discount factor.

A (stochastic) policy, denoted $\pi:\mathcal{S}\times\mathcal{A}\to[0,1]$, is a mapping from states to probabilities of selecting each possible action.

Let $\eta(\pi)$ denoted the expected cumulative discounted reward when starting at initial state $s_0$ and following $\pi$ thereafter \begin{equation} \eta(\pi)=\mathbb{E}_{s_0,a_0,\ldots}\left[\sum_{t=0}^{\infty}\gamma^t r(s_t,a_t)\right], \end{equation} where \begin{equation} s_0\sim\rho_0(s_0),\hspace{1cm}a_t\sim\pi(a_t\vert s_t),\hspace{1cm}s_{t+1}\sim P(s_{t+1}\vert s_t,a_t) \end{equation} For a policy $\pi$, to measure how good it is to be at a state, or how good it is to take action $a$ at state $s$, we use state value function, denoted $V_\pi(s)$, and action value function, referred $Q_\pi(s,a)$. In particular, these value functions are defined as the expected return \begin{align} V_\pi(s_t)&=\mathbb{E}_{a_t,s_{t+1},\ldots}\left[\sum_{k=0}^{\infty}\gamma^k r(s_{t+k},a_{t+k})\right], \\ Q_\pi(s_t,a_t)&=\mathbb{E}_{s_{t+1},a_{t+1},\ldots}\left[\sum_{k=0}^{\infty}\gamma^k r(s_{t+k},a_{t+k})\right], \end{align} where \begin{equation} a_t\sim\pi(a_t\vert s_t),\hspace{1cm}s_{t+1}\sim P(s_{t+1}\vert s_t,a_t)\hspace{1cm}t\geq 0 \end{equation}

Likelihood Ratio Policy Gradient

Let $H$ denote the horizon of an MDP1. Consider likelihood ratio policy gradient problem, in which the policy $\pi_\theta$ is parameterized by a vector $\theta\in\mathbb{R}^n$. The expected return of $\pi_\theta$ is then given by \begin{equation} \eta(\pi_\theta)=\mathbb{E}_{P(\tau;\theta)}\left[\left.\sum_{t=0}^{H-1}\gamma^t r(s_t,a_t)\right\vert\pi_\theta\right]=\sum_{\tau}P(\tau;\theta)R(\tau),\label{eq:lrp.1} \end{equation} where

  • $P(\tau;\theta)$ is the probability distribution induced by the policy $\pi_\theta$, i.e. $s_t$
  • $\tau=(s_0,a_0,s_1,a_1,\ldots,s_H,a_H)$ are trajectories generated by rolls out, i.e. $\tau\sim P(\tau;\theta)$.
  • $R(\tau)$ is the discounted cumulative rewards along the trajectory $\tau$, given as \begin{equation} R(\tau)=\sum_{t=0}^{H-1}\gamma^t r(s_t,a_t) \end{equation}

The likelihood ratio policy gradient performs a SGA (stochastic gradient ascent) over the policy parameter space $\Theta$ to find a local optimum of $\eta(\pi_\theta)$ by taking into account the policy gradient \begin{align} \nabla_\theta\eta(\pi_\theta)&=\nabla_\theta\sum_{\tau}P(\tau;\theta)R(\tau) \\ &=\sum_\tau\nabla_\theta P(\tau;\theta)R(\tau) \\ &=\sum_\tau\nabla_\theta\frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\ &=\sum_{\tau}P(\tau;\theta)\nabla_\theta\log P(\tau;\theta)R(\tau) \\ &=\mathbb{E}_{P(\tau;\theta)}\Big[\nabla_\theta P(\tau;\theta)R(\tau)\Big]\label{eq:lrp.2} \end{align} This gradient can be approximated with empirical estimate from $m$ trajectories $\tau^{(1)},\ldots,\tau^{(m)}$ under policy $\pi_\theta$ \begin{align} \nabla_\theta\eta(\pi_\theta)&=\mathbb{E}_{P(\tau;\theta)}\Big[\nabla_\theta P(\tau;\theta)R(\tau)\Big] \\ &\approx\frac{1}{m}\sum_{i=1}^{m}\nabla_\theta\log P(\tau^{(i)};\theta)R(\tau^{(i)})=\hat{g}, \end{align} which is an unbiased estimate of the policy gradient.

Additionally, since \begin{align} \nabla_\theta\log P(\tau^{(i)};\theta)&=\nabla_\theta\log\prod_{t=0}^{H-1}P(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)})\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) \\ &=\nabla_\theta\sum_{t=0}^{H-1}\log P(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)})+\nabla_\theta\sum_{t=0}^{H-1}\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) \\ &=\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}), \end{align} we can rewrite the policy gradient estimate in a form which no longer require the dynamics model \begin{equation} \hat{g}=\frac{1}{m}\sum_{i=1}^{m}\nabla_\theta\log P(\tau^{(i)};\theta)R(\tau^{(i)})=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})R(\tau^{(i)}) \end{equation} Moreover, as \begin{equation} \mathbb{E}_{P(\tau\vert\theta)}\Big[\nabla_\theta\log P(\tau;\theta)\Big]=\nabla_\theta\sum_\tau P(\tau;\theta)=\nabla_\theta 1=\mathbf{0}, \end{equation} a constant baseline in terms of $\theta$ (i.e. independent of $\theta$) $b$ can be inserted into \eqref{eq:lrp.2} to reduce the variance (where $b$ is a vector which can be optimized to minimize the variance). In particular \begin{align} \nabla_\theta\eta(\pi_\theta)&=\mathbb{E}_{P(\tau;\theta)}\Big[\nabla_\theta\log P(\tau;\theta)(R(\tau)-b)\Big] \\ &\approx\frac{1}{m}\nabla_\theta\log P(\tau^{(i)};\theta)\left(R(\tau^{(i)})-b\right) \\ &=\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})\left(R(\tau^{(i)})-b\right)=\hat{g} \end{align} By separating $R(\tau^{(i)})$ into sum of discounted rewards from the past, which does not depend on the current action $a_t^{(i)}$ (and thus can be removed to reduce the variance), and sum of discounted future rewards, we can continue to decompose the estimator $\hat{g}$ as \begin{align} \hat{g}&=\sum_{i=1}^{m}\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})\left(R(\tau^{(i)}-b)\right) \\ &=\sum_{i=1}^{m}\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})\Bigg[\sum_{k=0}^{t-1}r(s_k^{(i)},a_k^{(i)})+\left(\sum_{k=t}^{H-1}r(s_k^{(i)},a_k^{(i)})\right)-b\Bigg] \\ &=\sum_{i=1}^{m}\sum_{t=0}^{H-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})\left(\sum_{k=t}^{H-1}r(s_k^{(i)},a_k^{(i)})-b\right) \end{align} These following are some possible choices of baseline $b$.

  • Average rewards. \begin{equation} b=\mathbb{E}\big[R(\tau)\big]\approx\frac{1}{m}\sum_{i=1}^{m}R(\tau^{(i)}) \end{equation}
  • Optimal baseline. \begin{equation} b=\frac{\sum_{i=1}^{m}\left(\nabla_\theta\log P(\tau^{(i)};\theta)\right)^2 R(\tau^{(i)})}{\sum_{i=1}^{m}\left(\nabla_\theta\log P(\tau^{(i)};\theta)\right)^2} \end{equation}
  • Time-dependent baseline. \begin{equation} b_t=\frac{1}{m}\sum_{i=1}^{m}\sum_{k=t}^{H-1}R(s_k^{(i)},a_k^{(i)}) \end{equation}
  • State-dependent baseline. \begin{equation} b(s_t^{(i)})=\mathbb{E}_\pi\left[\sum_{k=t}^{H-1}\gamma^k r(s_k^{(i)},a_k^{(i)})\right]=V_\pi(s_t^{(i)}) \end{equation}

Importance Sampling

Consider a function $f$ and a probability measure $P$. The expectation of $f$, defined by \begin{equation} \mathbb{E}_{P(X)}\big[f(X)\big]=\int_{x}P(x)f(x)\hspace{0.1cm}dx, \end{equation} sometimes can be difficult to compute. We can resolve this problem by instead approximating the expectation by sampling method, as \begin{equation} \mathbb{E}_{P(X)}f(X)\approx\frac{1}{m}\sum_{i=1}^{m}f(x^{(i)}), \end{equation} where $x^{(i)}\sim P$. However, samples generated from $P$ are not always easily obtained. This is where importance sampling plays its role.

The idea of importance sampling, or IS is an observation that \begin{align} \mathbb{E}_{P(X)}\big[f(X)\big]&=\int_{x}P(x)f(x)\hspace{0.1cm}dx \\ &=\int_{x}Q(x)\frac{P(x)}{Q(x)}f(x)\hspace{0.1cm}dx \\ &=\mathbb{E}_{Q(X)}\left[\frac{P(X)}{Q(X)}f(X)\right], \end{align} where we have assumed that $Q(x)=0\Rightarrow P(x)=0$. Hence, we can use a sample-based method on $Q$ to get estimate of the expectation. Specifically, given $x^{(i)}\sim Q$, for $i=1,\ldots,m$, we can construct an unbiased estimator \begin{equation} \mathbb{E}_{P(X)}\big[f(X)\big]=\mathbb{E}_{Q(X)}\left[\frac{P(X)}{Q(X)}f(X)\right]\approx\frac{1}{m}\sum_{i=1}^{m}\frac{P(x^{(i)})}{Q(x^{(i)})}f(x^{(i)}) \end{equation}

Likelihood Ratio Policy Gradient via IS

The IS method suggests us rewrite the expected return of policy $\pi_\theta$ as2 \begin{align} \eta(\pi_\theta)&=\mathbb{E}_{\tau\sim P(\tau;\theta)}\big[R(\tau)\big] \\ &=\mathbb{E}_{\tau\sim P(\tau;\theta’)}\left[\frac{P(\tau;\theta)}{P(\tau;\theta’)}R(\tau)\right] \end{align} Taking the gradient w.r.t $\theta$ gives us another representation of the policy gradient \begin{align} \nabla_\theta\eta(\pi_\theta)&=\nabla_\theta\mathbb{E}_{\tau\sim P(\tau;\theta’)}\left[\frac{P(\tau;\theta)}{P(\tau;\theta’)}R(\tau)\right] \\ &=\mathbb{E}_{\tau\sim P(\tau;\theta’)}\left[\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta’)}R(\tau)\right], \end{align} which implies that \begin{align} \nabla_\theta\eta(\pi_\theta)\big\vert_{\theta=\theta’}&=\mathbb{E}_{\tau\sim P(\tau;\theta’)}\left[\frac{\nabla_\theta P(\tau;\theta)\big\vert_{\theta=\theta’}}{P(\tau;\theta’)}R(\tau)\right] \\ &=\mathbb{E}_{\tau\sim P(\tau;\theta’)}\big[\nabla_\theta\log P(\tau;\theta)\big\vert_{\theta=\theta’}R(\tau)\big] \end{align}

References

[1] Jie Tang, Pieter Abbeel. On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. NIPS 2010.

[2] Richard S. Sutton & Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.

Footnotes


  1. Any infinite-horizon discounted MDP, as defined in the preceding subsection, can be $\epsilon$-approximated by a finite horizon MDP, using a horizon \begin{equation*} H_\epsilon=\left\lceil\log_\gamma\left(\frac{\epsilon(1-\gamma)}{R_\text{max}}\right)\right\rceil, \end{equation*} where \begin{equation*} R_\text{max}=\max_s\big\vert R(s)\big\vert \end{equation*} ↩︎

  2. We can also use importance sampling to construct an unbiased estimate of the expected return $\eta(\pi_\theta)$. In particular, the expected return of $\pi_\theta$ given in \eqref{eq:lrp.1} can be approximated by \begin{equation*} \eta(\pi_\theta)=\sum_{\tau\sim P}P(\tau;\theta)R(\tau)\approx\frac{1}{m}\sum_{i=1}^{m}\frac{P(\tau^{(i)};\theta)}{Q(\tau^{(i)})}R(\tau^{(i)}) \end{equation*} where $\tau^{(i)}\sim Q$ and we have assumed that $Q(\tau^{(i)})=0\Rightarrow P(\tau^{(i)};\theta)=0$.
    If we choose $Q(\tau)\doteq P(\tau;\theta’)$, this means we are approximating the expected return of $\pi_\theta$ from trajectories given according to another parameterized policy $\pi_{\theta’}$. \begin{equation*} \eta(\pi_\theta)\approx\frac{1}{m}\sum_{i=1}^{m}\frac{P(\tau^{(i)};\theta)}{P(\tau^{(i)};\theta’)}R(\tau^{(i)}) \end{equation*} Taking the gradient of this estimator w.r.t $\theta$, we obtain an unbiased estimator for the policy gradient specified at \eqref{eq:lrp.2} \begin{align*} \nabla_\theta\eta(\pi_\theta)&\approx\nabla_\theta\frac{1}{m}\sum_{i=1}^{m}\frac{P(\tau^{(i)};\theta)}{P(\tau^{(i)};\theta’)}R(\tau^{(i)}) \end{align*} Similar to applying IS to off-policy learning, evaluating the importance weights, or importance sampling ratio does not require a dynamics model \begin{equation*} \frac{P(\tau^{(i)};\theta)}{P(\tau^{(i)});\theta’}=\frac{\prod_{t=0}^{H-1}P(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)})\pi_\theta(a_t^{(i)}\vert s_t^{(i)})}{\prod_{t=0}^{H-1}P(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)})\pi_{\theta’}(a_t^{(i)}\vert s_t^{(i)})}=\prod_{t=0}^{H-1}\frac{\pi_\theta(a_t^{(i)}\vert s_t^{(i)})}{\pi_{\theta’}(a_t^{(i)}\vert s_t^{(i)})} \end{equation*} ↩︎