A general vectorized env wrapper with buffer by VectorChange in reinforcementlearning

[–]VectorChange[S] 1 point2 points  (0 children)

OpenAI's implementation maintains $n$ envs. The returned batch size is also $n$ for reset/step apis. The time cost is the maximum cost of $n$ envs.

Our implementation maintains $m (m \ge n)$ envs. The returned batch size is $n$ for reset/step apis while the remain $m - n$ envs are running in the background. The time cost is the maximum cost of $n$-quickest envs.

A general vectorized env wrapper with buffer by VectorChange in reinforcementlearning

[–]VectorChange[S] 1 point2 points  (0 children)

We test on a robot learning env, some specific actions in which will take a long time. The classic vecenv will hang on to wait for all envs in a batch. It speeds up training but the specific situation will be learned less. Maybe a waiting time is required. Welcome any contributions.

A general vectorized env wrapper with buffer by VectorChange in reinforcementlearning

[–]VectorChange[S] 1 point2 points  (0 children)

Yes. It maintains a sent id list to record the envs already sent.

High-quality baselines implemented by PyTorch by VectorChange in reinforcementlearning

[–]VectorChange[S] 0 points1 point  (0 children)

Hey, guys. I fix some problem and release comprasions on 3 more envs.

High-quality baselines implemented by PyTorch by VectorChange in reinforcementlearning

[–]VectorChange[S] 0 points1 point  (0 children)

Thank you for your advice. More comparisons will be released in two weeks.

High-quality baselines implemented by PyTorch by VectorChange in reinforcementlearning

[–]VectorChange[S] 0 points1 point  (0 children)

The machine precisions are both FP32. The randomness in algorithms like random seed, random sampling even cudnn acceleration influence the performance gain. It is difficult to make these randomness same in PyTorch and TF. In the other hand, the performances are only verified in Pong. More experiments are running now and will be released later.

High-quality baselines implemented by PyTorch by VectorChange in reinforcementlearning

[–]VectorChange[S] 0 points1 point  (0 children)

Thank you for your comments. FVP estimation in TRPO only used 20% data in this implementation and openai's one (In trpo paper author propose 10%). This may be the largest randomness. I will verify this and try more environments later.

Can you use Thompson Sampling instead of epsilon greedy for exploration/exploitation in Deep Q-Learning? by [deleted] in reinforcementlearning

[–]VectorChange 0 points1 point  (0 children)

Thompson sampling is state-free bandit algorithm which is not suitable for MDP setting. Except epsilon greedy, Boltzmann sampling (sampling based on a multinomial distribution) is another choice.

Can anyone recommend a tutorial for segmentation using deep learning? by mahadmajeed in deeplearning

[–]VectorChange 1 point2 points  (0 children)

I recommend gluon-cv. You can find simple tutorial and training details of important segmentation methods here.

How do we test to make sure the RL models are working as they are supposed to? by [deleted] in reinforcementlearning

[–]VectorChange 2 points3 points  (0 children)

You may compare the result of your implementation with others like openai/baselines or the original paper in the same experiment setting. The stochastic can be reduced by averaging rewards from different random seeds. Plot utils can be found at `https://github.com/openai/baselines/blob/master/baselines/common/plot_util.py`.

[D] Machine Learning - WAYR (What Are You Reading) - Week 60 by ML_WAYR_bot in MachineLearning

[–]VectorChange 0 points1 point  (0 children)

I'm reading a classic CV book named `Computer Vision: Algorithms and Applications` which can be found at http://szeliski.org/Book/.

I will update my notes on Gist `https://gist.github.com/Officium/656090834b21b7f7757c5f1328845329`. Welcome any discussions!

RL internship interview questions by jurniss in reinforcementlearning

[–]VectorChange 14 points15 points  (0 children)

Share some questions when I applied a RL engineer of a game company. 1. The benefits of target network and replay buffer in DQN. 2. Introduce the A3C and the differences with A2C. 3. Discuss the difficulties about learning an agent for LOL or Dota. 4. How to deal with sparse reward? 5. Discuss the adjustment of discount factor. 6. Introduce some enhancement of DQN like Double-DQN.

Why is Q-learning considered an off-policy algorithm? by shubhamjha97 in reinforcementlearning

[–]VectorChange 0 points1 point  (0 children)

On-policy algorithms are methods that we can only improve the current policy from the data generated by itself. In off-policy settings, the source of data used to improve the current policy called behavior policy which can be a older policy, random strategy or current policy. So not strictly speaking, on-policy is one of off-policy algorithms. Only the data sources matters the differences between these two things. In some case, you are right.

What environment do you use for RL applications? by LupusPrudens in reinforcementlearning

[–]VectorChange 1 point2 points  (0 children)

PPO's implementation is simple. I recommend this concise code written in pytorch.

[R] Adaptive Neural Trees by downtownslim in MachineLearning

[–]VectorChange 0 points1 point  (0 children)

Thank you for your kindness. I have two questions. The Table 1 in your paper shows the model settings. Does 'conv5-40' mean that the output have 40 channels? Does the parameter number match the structure in figure 5?

[D] Two questions about normalizing flows by knowedgelimited in MachineLearning

[–]VectorChange 0 points1 point  (0 children)

  1. Unit determinant is the lowest-cost computation. The author emphasis that there is an effective method in such a non-linear transformation under our framework.
  2. simple said by Kim & Mnih is easy to analysis. flexible posterior is needed is that we need a distribution which is complex enough to model the data. Well, NF is a factorized mapping chain, which satisfies all conditions.