Policy Gradients
In this article, we will learn about Policy Gradients and implement it in Pytorch.
Introduction
Policy Gradients are a family of model-free reinforcement learning algorithms.
In the previous blog posts, we saw Q-learning based algorithms like DQN and DRQNs where given a state we were finding the Q-values of the possible actions where the Q-values are the expected return for the episode we can get from that state if that action is selected. And we were using an epsilon-greedy strategy to select the action. However, in Policy Gradients method instead of approximating the action value function, we are directly learning the optimal policy.
Advantages of Policy Gradient Method
1.Better Convergence properties.
2.Continuous Action Space - We cannot use Q-learning based methods for environments having Continuous action space. However, policy gradient methods can be used for such cases.
3.Policy Gradients can learn Stochastic policies. As we will see in the Implementation details section that we choose the action stochastically and hence we need not use something like an epsilon-greedy strategy that we used in Q-learning, the policy will itself be stochastic.
Policy Gradients
In Policy Gradients Method we are trying to learn the optimal policy. Here, we can take the example of a function y = f(x) i.e., for some input x, the output is f(x) or y. Similarly, we can say that the policy is a function such that when different states are given as inputs, it will output the probabilities corresponding to the different actions. So, our goal is to approximate this function so that we can get the maximum possible reward.
When we use a neural network, we can say that the weights of the neural network are the parameters of the policy. Therefore we need to update these parameters in such a way that the expected reward is maximized.
So, in general, classification problems, we have a loss function, and our objective is to minimize the loss. Here we have the reward function, and our objective is to maximize the reward.
Here, the policy is parametrized by , and is the return and R is the reward. Now we need to find the Gradient of this function with respect to the parameters of the policy theta so that we can perform the parameter update as follows:-
However, directly calculating the gradient is tricky as it depends upon both the policy and the state distribution. That is, since the environment is unknown we do not know the effect of policy update on state distribution. Therefore, we need to use the policy gradient theorem which provides a reformation such that the gradient does not depend upon the state.
( Image Source )
For the proof of this theorem, you can refer to the book Sutton & Barto, 2017; Sec. 13.1.
Policy Gradient Algorithm
Pseudo Code:
1.Initialize parameters
2.for i in number of episodes:
3.Generate trajectory using the policy.
4.for t in timestep:
5.Estimate the return value
6.Evaluate the policy and update the parameters.
So we Generate a trajectory using the current policy, i.e., by taking actions according to the current policy. This trajectory can be one episode of an episodic task or a fixed number of time steps for a non-episodic task. Then we evaluate the current policy using the reward function and calculate the gradients and update the parameters.
Implementational Details
In this section, we will go through the code to implement this algorithm in Pytorch. We will use OpenAI’s Cartpole environment.
1.Importing necessary libraries
2.Defining the model
3.Setting up the environment and declaring optimizer
4.Policy Gradient Algorithm
So the way we select the actions is by sampling from the probability distribution, i.e., the policy outputs probability of taking each action and from that distribution, we sample an action. For this, we can use the Categorical distribution from Pytorch which samples a number based on the probability corresponding to each index, i.e., if the input is 0.5, 0.3, 0.2, then it will sample with a probability of 0.5 for 0, 0.3 for 1 and 0.2 for 2.
5.Run the code -
6.Output -
The code is available on Github.
Results
Even though in the above example the policy converges quite fast in just 200 episodes we observe an average reward of 300+ while it goes down and up thereafter. However, if we run the same code many times we will observe that sometimes the average reward does not even increase beyond 10. The reason behing this is Policy Gradients algorithms are prone to getting stuck in some local optimum and not doing well at all thereafter. But Q-learning based algorithms are less prone to getting stuck in local optimum. This is one of the disadvantages of Policy Gradients.