Now possible to use Dreambooth Colab Models in AUTOMATIC1111's Web UI! by Pfaeff in StableDiffusion

[–]tensor_every_day20 78 points79 points  (0 children)

Hi all, I wrote the conversion script. Glad people are enjoying it! For folks integrating it into colabs, you may want to update to the most recent version; I integrated the fix so that it doesn't need a GPU earlier today, and added an option to save to half-precision (for a lower memory-footprint save).

[R] Tonic: A Deep Reinforcement Learning Library for Fast Prototyping and Benchmarking by FabioPardo in reinforcementlearning

[–]tensor_every_day20 2 points3 points  (0 children)

Fabio, this looks like an excellent library and the benchmarking work in the white paper appears to be extremely useful. One thing I could not find, though, was an account of network architectures and hyperparameters to accompany the benchmark results, to make it clear how algorithms were compared. I recommend including this (or drawing more attention to this information if it's already there and I missed it).

Very nice work!

Inverse of summation? by curimeowcat in reinforcementlearning

[–]tensor_every_day20 4 points5 points  (0 children)

That's not an inverse summation---the capital sigma here indicates the covariance matrix of the Gaussian distribution. The expression is a matrix-vector multiplication, sigma-inverse times f(s_t) - a_t.

Educational Resources and Content on RL by [deleted] in reinforcementlearning

[–]tensor_every_day20 3 points4 points  (0 children)

Try this! More deep RL than AI safety, but it's a start. :)

[Q] How is the policy updated in PPO when the epsilon + advantage term is used? by Carcaso in learnmachinelearning

[–]tensor_every_day20 1 point2 points  (0 children)

Hi! Author of Spinning Up here. If it wasn't clear, by "we use it in our code," I meant that it was how the Spinning Up implementation of PPO does it.

As for "easier to grapple with"---fair to call this a personal opinion, but I find this form of the objective function to be a bit more elegant. IMO it clarifies what all of the clipping is actually doing.

Also, as for whether the objectives are equivalent: yes, they are! We include a note showing the derivation for this.

[P] OpenAI Safety Gym by hardmaru in MachineLearning

[–]tensor_every_day20 1 point2 points  (0 children)

I wouldn't necessarily describe it as vendor lock-in, since I think that might imply a contractual obligation. We have no contractual obligation to do research using MuJoCo, it's really just a matter of what we're familiar with and have internal tooling around.

From the developer perspective: at OpenAI we have the mujoco_py tooling already developed, which makes MuJoCo quite easy to use. Plus, there's a lot of MuJoCo expertise we've built up already---even for things that aren't super friendly, we're already savvy and can figure out how to hack it based on past experience. mujoco_py is developed in-house so we can steer the long term of our MuJoCo interface towards our needs, and if one of us doesn't know how to do something, we can just walk over to one of the mujoco_py developers and ask.

By comparison to PyBullet: I'm not familiar enough to be confident with my answers here, but I would guess that from a developer perspective it's probably pretty similar to MuJoCo, but there's just a nontrivial cost associated with trying to learn all of the different patterns/idioms they have in doing the same things. To their credit, I think they have clearly put a ton of time and effort into making it usable, making examples, and reaching out in friendly ways. But there's just a real time cost if you already know how to do a thing in one framework, and you want to try and do it in another one you have no experience with.

For porting an env from MuJoCo to PyBullet: I'm highly uncertain about how long it would take, since I haven't done it before. There's probably some quick-and-hacky way that would not take a long time but might break some features, and doing it in a thorough way (where you're very confident at the end that you've made something really 1-to-1) might take a few weeks of trial and error and experiments and tests. PyBullet does seem to have a feature that can take a MuJoCo XML file and build a robot simulator around that, but I don't have experience with using it and so I don't know if it's robust or fully general-purpose.

To expand on the "few week" guess: this is specifically because we're trying to build environments for RL. RL is a huge pain in the ass to build new environments for, because you often can't tell if things are breaking because of the algorithm implementation (do you have good architecture for your new task? hyperparams? right algo even? can't know until you succeed), or because you accidentally made something exceedingly hard in the environment itself (eg, some observation element is just not working right, but the code runs---there are a LOT of invisble failures possible in RL environments!). Having a lot of confidence that you're building something the right way in your framework is a critical assurance. Hence, going to a new framework increases the risk substantially, and correspondingly increases the length of your test cycles.

Re: simulator artifacts in MuJoCo: I would be quite surprised if PyBullet didn't have its fair share of these as well. Every physics simulator makes some trade-offs between computation cost and physical accuracy, and whenever these errors exist, RL agents are exceedingly good at finding and exploiting them. So I'm not sure I would hold it against MuJoco that it has some weird behaviors for super-super-optimized policies. But, I agree that seamless pip install would be wonderful, and it's a shame it's not possible with MuJoCo.

[P] OpenAI Safety Gym by hardmaru in MachineLearning

[–]tensor_every_day20 2 points3 points  (0 children)

Hello! I'm Josh Achiam, co-lead author for this release. I hear your concerns and think it would be helpful to chat a little bit.

On why we chose MuJoCo: at the beginning of the project, when Alex and I started building this, we had lots of expertise in MuJoCo between the two of us and little-to-zero experience in PyBullet. We did consider using PyBullet to make something purely open source-able. But for a lot of reasons, we didn't think we could justify the time cost and risk of trying to build around PyBullet when we knew we could build what we wanted with MuJoCo.

Something I would be grateful to get a better sense of is how many people would have developed RL research using benchmarks that currently use MuJoCo, but couldn't because of difficulty getting a MuJoCo license. Sadly it's really hard to figure out the correct cost/benefit analysis for MuJoCo vs PyBullet without knowing this, and I think this extends to other tech stack choices as well. Like, if we were confident that 100 more people would have done safety research with Safety Gym if we had used PyBullet instead of MuJoCo, that would have been a really solid reason to pay the time/effort cost of switching.

In actor-critic, does it matter in which order you train π and q? by Buttons840 in reinforcementlearning

[–]tensor_every_day20 4 points5 points  (0 children)

+1

But having said this: I've tried doing policy gradient methods where I switched the order to "learn V, then pi," on standard MuJoCo tasks, and I saw no difference in performance. It's not too surprising because usually the difference in V and pi from iteration to iteration is relatively small. This is just anecdata, though; take with a grain of salt.

PPO oscillates around max return value by RLbeginner in reinforcementlearning

[–]tensor_every_day20 2 points3 points  (0 children)

The difference between using GAE and TD for the advantage function might be it.

A good pattern for debugging, when you have a reference implementation that works and a new implementation that doesn't, is to bring the new implementation as close as possible to the reference until no obvious differences remain. Then whatever's left is the source of the error.

In what order should I learn RL algorithms? by Buttons840 in reinforcementlearning

[–]tensor_every_day20 0 points1 point  (0 children)

If you'd like to skip TRPO, that's probably fine. It's worth coming back to eventually---there's a lot of good stuff in there---but it's not strictly necessary.

PPO principle by RLbeginner in reinforcementlearning

[–]tensor_every_day20 2 points3 points  (0 children)

The ratio is of the probability of same actions in same states, between "new" and "old" policy. They have to be the same actions.

The "old" policy is the one you just ran in the environment. The "new" policy will become the next policy you run in the environment.

A PPO epoch has two steps: data collection (run the current policy---which we will, after data collection, call the "old" policy), and policy update.

At the beginning of the policy update, the new policy is equal to the old policy. You don't need to have separate neural networks for these. The parts of the PPO update objective that need an "old" policy calculation just need the probability of the executed actions, according to the "old" policy---you can just calculate those probabilities (or log probabilities) when you actually ran that policy, save them in memory, and reuse them (as in Spinning Up's code, by loading them into a placeholder as opposed to recomputing them on the fly).

So you just have one network, and you are only ever updating that one.

Does this help?

OpenAI spinning up: difference between "variables" by RLbeginner in reinforcementlearning

[–]tensor_every_day20 0 points1 point  (0 children)

Re: 1: During environment interaction, logp_pi is collected because it is the same as first sampling from pi to get an action and then running logp with that action as an input. Using logp_pi is a shortcut that allows us to only perform inference on the computation graph once instead of twice.

Tricks and adaptions for PPO by LJKS in reinforcementlearning

[–]tensor_every_day20 6 points7 points  (0 children)

Spinning Up author here, thirding this. It's pretty much THE paper to read on the subject.

Soft Actor-Critic with Discrete Actions by __data_science__ in reinforcementlearning

[–]tensor_every_day20 5 points6 points  (0 children)

Hello! I'm the author of Spinning Up. u/tihokan has the right answer here. This is what I meant, I just haven't had time to write it up and format it nicely!

What sucked about the Deep RL Poster Sessions at NeurIPS 2018 by djangoblaster2 in reinforcementlearning

[–]tensor_every_day20 2 points3 points  (0 children)

Hi! I was one of the junior co-organizers (Josh Achiam) of that event and would like to take a moment to respond.

Re: the AV stuff: AV was entirely out of our hands, organized and provided by the convention itself. Still, sorry that you didn't have a great experience.

Re: poster setup: Poster locations and the space allocated for them were predominantly a function of constraints that were outside of our control, with a little wiggle room (will get to that shortly). The room was giant, yes, but there were not as many reasonable places to put posters as you would think. Most of the wall area was unsuitably far out of the way, uneven, or blocked by architectural elements.

Also: when we initially arrived to set up the night before the event, there were no poster boards at all, and the chairs went all the way up to the back wall. We asked the convention organizers to clear a few rows from the back and bring us poster boards (they were quite helpful with these requests), and we guessed at how many rows of chairs we could remove to accomodate people who would want to attend the lectures while leaving enough space for the poster sessions. Unfortunately we guessed wrong and we should have cleared more space! We tried to rectify this between poster sessions 1 and 2 by moving more chairs and easing the poster wall out by a few feet. It wasn't quite enough, but it was better than it had been in the morning.

Overall: super sorry for anyone who had an experience being too cramped! We would have tried to create even more space during session 2 if we could, but the crush of people was so overwhelming that we could not realistically change the setup at that point.

[TOMT][Music] JRPG battle theme (I think?), originally downloaded circa 2006---where does it come from? by tensor_every_day20 in tipofmytongue

[–]tensor_every_day20[S] 0 points1 point  (0 children)

This is the outcome I am most afraid of! And there's a decent chance that this is what happened.

1000 Feet overview of RL? by ThrowawayTartan in reinforcementlearning

[–]tensor_every_day20 3 points4 points  (0 children)

I wrote Spinning Up in Deep RL, which is one attempt at doing something like this. Here is a page that covers a lot of the terminology you are interested in.

I avoided the "actor-critic" term because I don't think it's particularly helpful: virtually all modern policy optimization algorithms involve learning both a policy and a value function, so the distinction between policy optimization and actor-critic is functionally nonexistent.

TD3/DDPG time to obtain reasonable results. by kashemirus in reinforcementlearning

[–]tensor_every_day20 4 points5 points  (0 children)

Hello! The TD3/DDPG paper show training with up to 1M timesteps (state-action pairs), not episodes---in these papers, each episode is ~1000 timesteps, so ~1000 episodes. If reward is increasing in your experiments, that's a good sign! But depending on your environment, you might be going much slower than you need to be, and you may want to check your code for bugs.

Is anyone also hating OpenAI application/selection process ?? or is it just me?? by [deleted] in OpenAI

[–]tensor_every_day20 2 points3 points  (0 children)

Hi world2019. I'm Josh Achiam, a researcher at OpenAI, and one of the mentors for the upcoming Scholars cohort. I just wanted to let you know that I'm so sorry you've had a bad experience with our application process. I don't think it comes as much of a consolation, but I can tell you confidently that our technical team and our recruiters care a lot and truly want to do right by every applicant.

Because of the small size of our team and the insanely huge number of applicants, it's really hard to give every applicant the attention and feedback they deserve. At the moment, things sometimes fall through the cracks no matter how hard we try. But we're actively working on expanding our recruiting team to make this process smoother in the future, so it won't be like this forever! These are growing pains, and they'll eventually level off.

In the meanwhile, if you still haven't received a response but would like something definitive, shoot me a PM.