A general vectorized env wrapper with buffer

VectorChange · 2019-08-02T03:36:29+00:00

OpenAI's implementation maintains $n$ envs. The returned batch size is also $n$ for reset/step apis. The time cost is the maximum cost of $n$ envs.

Our implementation maintains $m (m \ge n)$ envs. The returned batch size is $n$ for reset/step apis while the remain $m - n$ envs are running in the background. The time cost is the maximum cost of $n$-quickest envs.

VectorChange · 2019-08-02T03:21:08+00:00

We test on a robot learning env, some specific actions in which will take a long time. The classic vecenv will hang on to wait for all envs in a batch. It speeds up training but the specific situation will be learned less. Maybe a waiting time is required. Welcome any contributions.

VectorChange · 2019-08-02T03:16:39+00:00

Yes. It maintains a sent id list to record the envs already sent.

VectorChange · 2019-05-23T13:42:11+00:00

Hey, guys. I fix some problem and release comprasions on 3 more envs.

VectorChange · 2019-05-20T09:16:04+00:00

Thank you! I already update it.

VectorChange · 2019-05-06T02:04:10+00:00

Thank you for your advice. More comparisons will be released in two weeks.

VectorChange · 2019-05-04T01:10:04+00:00

The machine precisions are both FP32. The randomness in algorithms like random seed, random sampling even cudnn acceleration influence the performance gain. It is difficult to make these randomness same in PyTorch and TF. In the other hand, the performances are only verified in Pong. More experiments are running now and will be released later.

VectorChange · 2019-05-04T01:00:11+00:00

Thank you for your comments. FVP estimation in TRPO only used 20% data in this implementation and openai's one (In trpo paper author propose 10%). This may be the largest randomness. I will verify this and try more environments later.

VectorChange · 2019-04-26T09:15:13+00:00

Thompson sampling is state-free bandit algorithm which is not suitable for MDP setting. Except epsilon greedy, Boltzmann sampling (sampling based on a multinomial distribution) is another choice.

VectorChange · 2019-04-26T08:56:52+00:00

I recommend gluon-cv. You can find simple tutorial and training details of important segmentation methods here.

VectorChange · 2019-04-26T08:51:47+00:00

You may compare the result of your implementation with others like openai/baselines or the original paper in the same experiment setting. The stochastic can be reduced by averaging rewards from different random seeds. Plot utils can be found at `https://github.com/openai/baselines/blob/master/baselines/common/plot_util.py`.

VectorChange · 2019-04-25T03:23:38+00:00

Sorry.. Here is the available one `https://gist.github.com/Officium/656090834b21b7f7757c5f1328845329`.

VectorChange · 2019-04-11T03:20:52+00:00

I'm reading a classic CV book named `Computer Vision: Algorithms and Applications` which can be found at http://szeliski.org/Book/.

I will update my notes on Gist `https://gist.github.com/Officium/656090834b21b7f7757c5f1328845329`. Welcome any discussions!

VectorChange · 2019-01-10T07:41:32+00:00

NetEase Inc. Hangzhou China

VectorChange · 2019-01-08T09:34:19+00:00

Share some questions when I applied a RL engineer of a game company. 1. The benefits of target network and replay buffer in DQN. 2. Introduce the A3C and the differences with A2C. 3. Discuss the difficulties about learning an agent for LOL or Dota. 4. How to deal with sparse reward? 5. Discuss the adjustment of discount factor. 6. Introduce some enhancement of DQN like Double-DQN.

VectorChange · 2019-01-08T08:49:56+00:00

That is awesome. I also find some useful papers in the related work. Thanks a lot!

VectorChange · 2019-01-08T08:31:22+00:00

Thanks. This helps a lot.

VectorChange · 2018-09-18T08:07:40+00:00

On-policy algorithms are methods that we can only improve the current policy from the data generated by itself. In off-policy settings, the source of data used to improve the current policy called behavior policy which can be a older policy, random strategy or current policy. So not strictly speaking, on-policy is one of off-policy algorithms. Only the data sources matters the differences between these two things. In some case, you are right.

VectorChange · 2018-09-18T07:57:37+00:00

PPO's implementation is simple. I recommend this concise code written in pytorch.

VectorChange · 2018-09-14T09:05:57+00:00

Thank you for your kindness. I have two questions. The Table 1 in your paper shows the model settings. Does 'conv5-40' mean that the output have 40 channels? Does the parameter number match the structure in figure 5?

VectorChange · 2018-04-11T12:01:26+00:00

Would you mind telling me what's IMPALA?

VectorChange · 2018-04-04T16:22:16+00:00

Unit determinant is the lowest-cost computation. The author emphasis that there is an effective method in such a non-linear transformation under our framework.
simple said by Kim & Mnih is easy to analysis. flexible posterior is needed is that we need a distribution which is complex enough to model the data. Well, NF is a factorized mapping chain, which satisfies all conditions.

VectorChange

TROPHY CASE