I'm struggling to figure out how I want to do this so I hope someone here may offer some guidance.
Scenario - I have a 10 character string, lets call it the DNA, made up of the following characters:
F
-
+
[
]
X
for example DNA = ['F', 'F', '+', '+', '-', '[', 'X', '-', ']', '-']
Now these DNA strings get converted to physical representations from whence I can get a fitness or reward value. So an RL flowchart for this scenario would look like this:
P.S. The maximum fitness is not known or specified.
Step 1: Get random DNA string
Step 2: Compute fitness
Step 3: Get another random DNA string
Step 4: Compute fitness
Step 5: Compute gradient and see which way is up
Step 6: Train ML algorithm to generate better and better DNA strings until fitness no longer increases
For clarity sake the best DNA string, i.e. the one who will return the highest fitness, for my purposes now is:
['F', 'X', 'X', 'X', 'X', 'F', 'X', 'X', 'X', 'X']
How can I train a ML algorithm to learn this and output this DNA string?
I'm trying to wrap my brain around Policy Gradient methods but what will my input to the ML algorithm be? There are no states like in the OpenAI Gym examples.
EDIT: Final goal - Algorithm that learns to generate higher fitness value DNA strings. This has to happen without any human supervision i.e. NOT supervised learning but reinforcement learning.
Akin to a GA that will evolve better and better DNA strings
Assuming that the problem is to mutate a given string into another string which has a higher fitness value, the Markov Decision Process can be modeled as:
fitness(next_state) - fitness(state) + similarity(state,next_state)
OR fitness(next_state) - fitness(state)
You could start with Q-learning with discrete actions of dimension:10 and each action having 6 choices: (F, -, +, [, ], X)