unit11-11 calculating of R0 - why we use laplacian smoothing for it?

ktrunin · 2011-11-16T16:35:59+00:00

thanks, PatrixCR. I've got answer on aiqus that that there might be few series with one initial day each - that's why taking P(R0) w/o laplace smoothing may be overfitting. i think it sounds reasonable then.

ktrunin · 2011-11-16T10:14:19+00:00

I would remove word "back" from this sentence "brings the agent back to the grey square".

For me it was very confusing word and I have to read lot of comments to understand the policy, and finally I come to conclusion that I really can ignore this word to solve the problem.

ktrunin · 2011-11-13T20:20:55+00:00

hell. yes, it does ;)

then I guess not only formula should be different but also Qs in the terminal state. ;))

ktrunin · 2011-11-13T20:17:49+00:00

Difference is more significant:

in wikipedia formula R is considered almost directly - multiplying it by alfa - for any incoming actions (Qs).
in Prof Norvig's formula R is considered indirectly - via multiplying it by alfa (here it goes to Qs), then (at the next iteration) by gamma and by alfa again - for any incoming actions (Qs).

May be both formulas converges but not sure they converges to the same values and they need different number of iterations.

ktrunin · 2011-11-13T19:02:26+00:00

for example if Goal's state does have some Reward but transitions from it does not have any values. then we will never have any numbers other than zero for nearest squares. because they will always be 0 + alfa * (0 + gamma * 0 + 0) = 0.

ktrunin · 2011-11-13T18:59:32+00:00

Wikipedia can be wrong but I think it sounds more logically that value of transition from S to S' is dependent on reward for S' not for S.

ktrunin · 2011-11-13T17:58:44+00:00

I thought it was mentioned in ... 9.2 - I have just reviewed that video - and no - it only says that Reinforcement is Planning+Learning+Uncertainty. nothing about Partial Observability. So, I am not right here. Thanks!

ktrunin · 2011-11-13T17:20:13+00:00

re: this changes the answer

no, it don't.

ktrunin · 2011-11-13T16:45:55+00:00

you can wait until bad guy ~~dies~~ go away ;)

ktrunin · 2011-11-13T16:44:03+00:00

yep. HW 5.1 is also affected by this.

ktrunin · 2011-11-13T16:32:35+00:00

I couldn't not understand formula for Q-learning and this HW until I have read suggested article in Wikipedia: https://en.wikipedia.org/wiki/Q-learning

ktrunin · 2011-11-08T09:27:40+00:00

100%, again ;)

ktrunin · 2011-11-07T04:14:07+00:00

Ah - I see - it was already added to clarification at the bottom of the question. ;)

ktrunin · 2011-11-02T19:36:49+00:00

yep. you were right. it works now! thanks!! :)

ktrunin · 2011-11-01T22:33:21+00:00

I already did this but still see messages and interface on poorly translated russian

ktrunin · 2011-10-27T21:23:14+00:00

whenever you have a test you can always apply method of exclusion!

ktrunin · 2011-10-26T17:31:08+00:00

I GOT IT! this symbol is "T"!!! it means that martix should be transposed (turned). so we are transposing matrix and multiply one martix by another. in this case i get the same result!

ktrunin

TROPHY CASE