🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Understanding how to train Neural Networks for Control Theory Q-learning and SARSA

Started by
-1 comments, last by Sevren 10 years, 10 months ago

Hi everyone. I've been learning about Reinforcement learning for the past little bit in an attempt to learn how to create a agent that I could use in a game i.e driving a car around a track. I want to learn how to combine the Neural network architecture with RL such as Q-learning or SARSA.

Normally in Error- back propagation Neural Networks you have both input and a given target
i.e xor pattern input is 0 0 or 1 1 or 0 1 1 0 and the target is either 0 or 1. This is given so it is easy for me to see where to plug in the values for my error back prop function. The problem for me now is given only the state variables in my testing problem of Mountain car or pendulum how do I go about using Error- back propagation?

Since I first want to build an agent that solves Mountain car as a test Is this the right set of steps?

S =[-0.5; 0] as the inital state ( input into my neural network)

  1. create network (2, X-hidden units,3) -> 2 inputs position and velocity and either 1 ouput or 3 outputs corresponding to actions, with Hidden activation function is sigmoid(tanh) and output is purelin
  2. Now run the state values for position and velocity into the network (Feed forward) and get 3 Q values as output, it's 3 outputs as that is how many actions I have.
  3. select an action A using e-greedy, either a random one or the best Q-value giving me which action to choose from this state.
  4. Execute action A for the problem and receive new state S' and reward
  5. Run S' through the neural network and obtain Q S' values

Now I guess I need to compute a target value... given Q-learning where Q(s,a) = Q(s,a)+alpha*[reward+gamma* MAX Q(s',a') -Q(s,a)]
I think my Target output is calculated using: QTarget=reward+gamma*MAX Q(s',a') right?

So that means now i choose the max Q-value from step 5 and plug it into the QTarget equation

Do I calculate an Error again like in the original backprop algo?

So Error=QTarget-Q(S,A) ?

and now resume normal Neural Network backprop weight updates?

Thanks, Sevren


This topic is closed to new replies.

Advertisement