[SOLVED] Reinforcement learning target data

Reinforcement learning target data

Let's say we have a robot that is able to adapt to changing environments. Similar to this paper 1. When there is a change in the environment[light dimming], the robot's performance drops and it needs to explore its new environment by collecting data and running the Q-algorithm again to update its policy to be able to "adapt". The collection of new data and updating of the policy takes about 4/5hrs. I was wondering if I have an army of these robots in the same room, undergoing the same environmental changes, can the data collection be sped up so that the policy can be updated more quickly? so that the policy can be updated in under 1 hour or so, allowing the performance of the robots to increase?

Solution

I believe you are talking about scaling learning horizontally as in training multiple agents in parallel.

A3C is one algorithm that does this by training multiple agents in parallel and independently of each other. Each agent has its own environment which allows it to gain a different experience than the rest of the agents, ultimately increasing the breadth of your agents collective experience. Eventually each agent updates a shared network asynchronously and you use this network to drive your main agent.

You mentioned that you wanted to use the same environment for all parallel agents. I can think of this in two ways:

If you are talking about a shared environment among agents, then this could possibly speed things up however you are likely not going to gain much in terms of performance. You are also very likely to face issues in terms of episode completion - if multiple agents are taking steps simultaneously then your transitions will be a mess to say the least. The complexity cost is high and the benefit is negligible.
If you are talking about cloning the same environment for each agent then you end up both gaining speed and a broader experience which translates to performance. This is probably the sane thing to do.