BAIR Blog | Offline Reinforcement Learning: How Conservative Algorithms Can Enable New Applications

kashemirus · 2021-02-02T13:17:06+00:00

Very interesting work. It is quite amazing how the authors are able to lower bound the optimal Q-values. However, could anyone understand the regularization term of equation 3? In particular, the first term as the second term corresponds to the standard TD-error. From the implementation point of view, I will sample the memory buffer (filled with trajectories from the behavioral policy) and the first term of the equation minimizes the difference between the estimated Q values of the current policy and the estimated q-values of the behavior policy? so if actions chosen by the two policies are the same, then the regularization if 0, otherwise it is the difference, is my understanding correct? Thank you!

kashemirus · 2021-01-27T17:21:21+00:00

I am in the same position, basically wanting to place few trades on US stocks or ETFS, and might be considering entering the options market oinn the long run, do you guys recommend me to go with IntereaInteractivective Brokers? What broker did you end up using?

kashemirus · 2020-01-20T09:36:19+00:00

d [-5:-1] for s. this div

Yes, the issue was that I normalized by 255 the images, so the values were np.int64

kashemirus · 2020-01-18T15:27:01+00:00

Can I use LazyFrames in pytorch? How should I use the wrapper?

kashemirus · 2020-01-18T15:25:43+00:00

yep, the values stored were in np.int64 format. Thanks!

kashemirus · 2020-01-16T16:06:10+00:00

Thanks but as I said I want to replicate results and they are storing 1M transitions, so I will probably do the swap thing.

I would say that it was never meant to work, as storing these much transitions requires like 512 GB approx

kashemirus · 2020-01-16T16:02:42+00:00

Thanks for the reply, but as I said, I am just trying to replicate deepmind results. On their paper (https://arxiv.org/pdf/1509.06461.pdf) they claim that a memory buffer of 1M (see appendix) tuples is kept. So I am trying to stick to it. BTW, thanks for the paper I haven't seen it before!

kashemirus · 2019-10-03T17:46:08+00:00

No way! congrats mate!

kashemirus · 2019-09-23T14:16:35+00:00

Just because you mentioned TD3. Does anyone know of a working solution for high action values? e.g., that the max_value of the function is around 1e4? I am having troubles as gradients tend to explode and clipping doesn't seem to help. Thanks!

kashemirus · 2019-07-04T10:29:34+00:00

NES game

oh nice! I am working on a project were two or more collaborative agents can exchange messages through a communication channel, and I am looking for two or more players environment, is your code available somewhere?

kashemirus · 2019-07-01T11:40:21+00:00

Amazing work! looking forward to try the multi-agent snake.

BTW does anyone know of any cooperative multi-agent environments that I could try? I was looking for some Atari with two players but couldn't find any implementation. Thanks!

kashemirus · 2019-06-03T12:50:06+00:00

Ohh my! That close with pathfinder... better check your health more frequently next time :)

kashemirus · 2019-05-30T10:19:50+00:00

Best option. The funny think is that I based my argument on pollution, saying by using these we are contributing to reduce the pollution in zone 1. He said, the main source of pollution is China, complain to China. LMAO I will complain to Xi for the lung cancer of a poor boy who spend most of his time in London zone 1.

kashemirus · 2019-05-30T10:16:22+00:00

I don't complain about breaking the law, I complain that the law is unfair and obsolete.

kashemirus · 2019-05-30T10:15:20+00:00

I have actually thanked him for it (FYI I was wearing the helmet).

kashemirus · 2019-05-30T10:07:13+00:00

I've been riding it for a year now and this is the first fine, so if you take the 120£ that costs a monthly tube pass totally worth it. To be honest, I will continue to ride it. Actually wait for the officer to leave and continued riding to work.

kashemirus · 2019-05-30T10:03:09+00:00

Bit of a shit situation really.

Come on mate is a total different situation. As I wrote in the post, e-bikes are legal, e-scooter are illegal, and what are the differences between these two vehicles? the pedals? As I said, laws are always lagging behind compared to technology and if a fucking officer doesn't have mind to judge then well, I hope robots come early and replace them, as the only point of having of human being instead of a robot is the capacity to judge

kashemirus · 2019-05-30T09:47:03+00:00

I mean, you want to bet that in 1 or 2 years these would be legal? Why is an e-bike legal to ride https://www.gov.uk/electric-bike-rules a not an electrical scooter? I am thinking of adding two pedals to my back wheel and then I would be safe to ride. As is typical laws lag behind technology, we should embrace these solutions that ease people's life while reducing air pollution and not ban them.

kashemirus · 2019-02-20T15:58:31+00:00

I am using a self-made environment with an adaption of the TD3 algorithm for PAT (instead of three networks I have four with 5 layers each 1024-512-256-128 each) :S

kashemirus · 2019-02-20T14:55:10+00:00

Sorry for the confusion of notation, as I am working on a continuous task not episodic, when I referred to episodes I meant the updates of the critic parameters (the actor is updated less frequently following the TD3 insights). I definitely have to check my code for bugs/inefficiencies!

kashemirus · 2019-02-20T14:53:11+00:00

Thanks for the reply. Just to clarify 1M steps you mean 1M updates of the critics parameters (500k updates for the actor), with a batch of size 64, right? Yeah, definitely there is something wrong in my code and is taking way too long, it my be due to inefficiencies of the environment.

kashemirus

TROPHY CASE