[SOLVED] Reinforcement learning toy project

Reinforcement learning toy project

My toy project to learn & apply Reinforcement Learning is:

An agent tries to reach a goal state "safely" & "quickly"....
But there are projectiles and rockets that are launched upon the agent in the way.
The agent can determine rockets position -with some noise- only if they are "near"
The agent then must learn to avoid crashing into these rockets..
The agent has -rechargable with time- fuel which is consumed in agent motion
Continuous Actions: Accelerating forward - Turning with angle

I need some hints and names of RL algorithms that suit that case..
- I think it is POMDP , but can I model it as MDP and just ignore noise?
- In case POMDP, What is the recommended way for evaluating probability?
- Which is better to use in this case: Value functions or Policy Iterations?
- Can I use NN to model environment dynamics instead of using explicit equations?
- If yes, Is there a specific type/model of NN to be recommended?
- I think Actions must be discretized, right?

Solution

If this is your first experiment with reinforcement learning I would recommend starting with something much simpler than this. You can start simple to get the hang of things and then move to a more complicated project like this one. I have trouble with POMDPs and I have been working in RL for quite a while now. Now I'll try to answer what questions I can.

I think it is POMDP , but can I model it as MDP and just ignore noise?

Yes. POMDP stands for Partially Observable Markov Decision Process. The partially observable part refers to the fact that the agent can't know it's state perfectly, but can estimate it based on observations. In your case, you would have the location of the rocket as an observation that can have some noise, and based on the agents previous knowledge you can update it's belief of where the missiles are. That adds a lot of complexity. It would be much easier to use the missile locations as absolutes and not have to deal with uncertainty. Then you would not have to use POMDPs.

In case POMDP, What is the recommended way for evaluating probability?

I don't understand your question. You would use some form of Bayes rule. That is, you would have some sort of distribution that is your belief state (probabilities of being in any given state), that would be your prior distribution and based on observation you would adjust this and get a posterior distribution. Look into Bayes rule if you need more info.

Which is better to use in this case: Value functions or Policy Iterations?

Most of my experience has been using value functions and find them relatively easy to use/understand. But I don't know what else to tell you. I think this is probably your choice, I would have to spend time working on the project to make a better choice.

Can I use NN to model environment dynamics instead of using explicit equations? If yes, Is there a specific type/model of NN to be recommended?

I don't know anything about using NN to model environments, sorry.

I think Actions must be discretized, right?

Yes. You would have to have a discrete list of actions, and a discrete list of states. Generally the algorithm will choose the best action for any given state, and for the simplest algorithms (something like QLearning) you just keep track of a value for every given state-action pair.

If you are just learning all of this stuff I would recommend the Sutton and Barto text. Also if you want to see a simple example of a RL algorithm I have a very simple base class and an example using it up at github (written in Python). The abstract_rl class is meant to be extended for RL tasks, but is very simple. simple_rl.py is an example of a simple task (it is a simple grid with one position being the goal and it uses QLearning as the algorithm) using base_rl that can be run and will print some graphs showing reward over time. Neither are very complex, but if you are just getting started may help to give you some ideas. I hope this helped. Let me know if you have any more or more specific questions.