machine-learningartificial-intelligence

How can I learn a reward function?


I'm am trying to go a bit outside the box for developing an AI for a game similar to othello.

I'm looking at a number of different techniques and algorithms to determine optimal moves, such as Negascout and MTD(f). However they all require a good evaluation function.

I've come up with a bunch of possible indicators {A_0...A_n} to be used in a function

G(state) = p_0*A_0 + p_1*A_1 + ... +p_n*A_n

And I want to somehow find good values for p_0 to p_n

One suggestion was to use machine learning to generate parameters for the function, but in reading, I found algorithms such as Q learning all required me to already have a reward function.

Additionally in reading about Td(lambda) I noticed that it didn't even need the indicators to be hand coded. What sort of reward function would it use to learn by?

What am I missing in my understanding?


Solution

  • I think you're confusing what would often be called the Q function, the estimate of the maximal summed (and possibly discounted) reward obtainable from a state, with the reward function.

    To elaborate: there exists a reward function R defined over (s,a,s') triples, which tells me the reward I receive when in state s I choose action a and end up in s'. To decide which action I should take, I want estimates of some quality function Q(s,a), which tells me the expected discounted future reward of taking action a in state s. The expectation is because in the general case your transition function may be probabilistic, so the same action in the same state may not always end with the same successor. This q function sums the Rs for every (s,a,s') triple on the trajectory from the current state, possibly applying a discount factor to weight more distant rewards lower, and also possibly employing a horizon.

    So in summary, R is given. The reinforcement learning problem is to come up with estimates of Q. Q can be approximated by a linear regression on a bunch of features of s and a, just like the form you give above, but critically given that you observe the trajectory from s you know the true value of the discounted future reward from s for that trajectory, so you have the correct answer to estimate a regression model from. Learning the reward function is an altogether different problem, and cannot be solved via Q learning, temporal difference, etc.