Skip to content Skip to footer

3 Important Considerations in DDPG Reinforcement Algorithm | by Manjeet Singh Nagi | Jun, 2024

Photo by Jeremy Bishop on Unsplash

Deep Deterministic Policy Gradient (DDPG) is a Reinforcement learning algorithm for learning continuous actions. You can learn more about it in the video below on YouTube:

Here are 3 important considerations you will have to work on while solving a problem with DDPG. Please note that this is not a How-to guide on DDPG but a what-to guide in the sense that it only talks about what areas you will have to look into.


The original implementation/paper on DDPG mentioned using noise for exploration. It also suggested that the noise at a step depends on the noise in the earlier step. The implementation of this noise is the Ornstein-Uhlenbeck process. Some people later got rid of this constraint about the noise and just used random noise. Based on your problem domain, you may not be OK to keep noise at a step related to the noise at the earlier step. If you keep your noise at a step dependent on the noise at the earlier step, then your noise will be in one direction of the noise mean for some time and may limit the exploration. For the problem I am trying to solve with DDPG, a simple random noise works just fine.

Size of Noise

The size of noise you use for exploration is also important. If your valid action for your problem domain is from -0.01 to 0.01 there is not much benefit by using a noise with a mean of 0 and standard deviation of 0.2 as you will let your algorithm explore invalid areas using noise of higher values.

Noise decay

Many blogs talk about decaying the noise slowly during training, while many others do not and continue to use un-decayed during training. I think a well-trained algorithm will work fine with both options. If you do not decay the noise, you can just drop it during prediction, and a well-trained network and algorithm will be fine with that.

As you update your policy neural networks, at a certain frequency, you will have to pass a fraction of the learning to the target networks. So there are two aspects to look at here — At what frequency do you want to pass the learning (the original paper says after every update of the policy network) to the target networks and what fraction of the learning do you want to pass on to the target network? A hard update to the target networks is not recommended, as that destabilizes the neural network.

But a hard update to the target network worked fine for me. Here is my thought process — Say, your learning rate for the policy network is 0.001 and you update the target network with 0.01 of this every time you update your policy network. So in a way, you are passing 0.001*0.01 of the learning to the target network. If your neural network is stable with this, it will very well be stable if you do a hard update (pass all the learning from the policy network to the target network every time you update the policy network), but keep the learning rate very low.

While you are working on optimizing your DDPG algo parameters, you also need to design a good neural network for predicting action and value. This is where the challenge lies. It is difficult to tell if the bad performance of your solution is due to the bad design of the neural network or an unoptimized DDPG algo. You will need to keep optimizing on both fronts.

While a simpleton neural network can help you solve Open AI gym problems, it will not be sufficient for a real-world complex problem. The principle I follow while designing a neural network is that the neural network is an implementation of your (or the domain expert’s) mental framework of the solution. So you need to understand the mental framework of the domain expert in a very fundamental manner to implement it in a neural network. You also need to understand what features to pass to the neural network and how to engineer the features in a way that the neural network can interpret them to successfully predict. And that is where the art of the craft lies.

I still have not explored discount rate (which is used to discount rewards over time-steps) and have not yet developed a strong intuition (which is very important) about it.

I hope you liked the article and did not find it overly simplistic or stupid. If liked it, please do not forget to clap!

Source link

Leave a comment