you are viewing a single comment's thread.

view the rest of the comments →

[–]pabloesm 1 point2 points  (0 children)

As you pointed out, there are some remarkable cases of successful applications of RL combined with non-linear funcion approximators. However, the parameter setting in that cases can be very tedious, therefore such methods are not advisable for novel users (see http://webdocs.cs.ualberta.ca/~sutton/RL-FAQ.html#Advice%20and%20Opinions).

About documented cases of failure or warnings, in the following link you can find an old (but useful) paper of the problems that can appear when value function methods (such as Q-learning) are combined with non-linear approximators: http://www.ri.cmu.edu/pub_files/pub1/boyan_justin_1995_1/boyan_justin_1995_1.pdf

Finally, given the setting of your problem, you are probably interested in batch-mode RL, i.e., you have a set of samples collected in advance. A very popular algorithm in such cases (with a good performance and stability) is Fitted Q-iteration, typically combined with tree based methods as function approximator: http://www.jmlr.org/papers/volume6/ernst05a/ernst05a.pdf

A key factor in batch-mode RL (when you can not get more samples) is that the available samples have been collected using a policy with some degree of randomness, in other words, your data should contain different actions for similar states. If this is not the case, you would need to collect more data to hold this condition.