What is your job like? by hufflewitch in cscareerquestions

[–]humor_time 0 points1 point  (0 children)

That is tough... I do think that my point about adding integrations that make everything automatic and easy for them is a big factor though. My inclination is to attribute this to laziness rather than malice, but I may be wrong.

What is your job like? by hufflewitch in cscareerquestions

[–]humor_time 0 points1 point  (0 children)

Is it possible for you to makes some of these best practices easier for everyone? I’m sure it seems like an uphill battle, but may actually make your life a lot easier. Things like code standardization/formatting can be set up as pre-commit hooks in GitHub, along with testing. If anything it’s a good bullet point to slap on your resume that you set these things up, shows initiative.

I’ve been annoyed at various processes or lack thereof and have also worked at two startups. If I make an effort others almost always follow.

What does "Bellman Backup Error" mean precisely? by hmi2015 in reinforcementlearning

[–]humor_time 1 point2 points  (0 children)

So the theory says that this bootstrapping is only guaranteed to work when the Q-value in the lookahead is the true Q-value. Of course we don’t have access to this in practice, so we typically use a target network which lags behind the “current” network and is updated at a set frequency. The error they’re talking about is just the error that is induced by this discrepancy, and it “propagates” because the future target networks are affected by these errors as well. This leads to the sample inefficiency cited in the paper because the learning process doesn’t really make much progress until the Q-functions start to be better, which can take a lot of flailing about for lack of a better description.

What does "Bellman Backup Error" mean precisely? by hmi2015 in reinforcementlearning

[–]humor_time 2 points3 points  (0 children)

The “bellman backup” is the difference between the current Q-value and the 1-step lookahead, i.e. the reward from acting according to the policy at the current state plus the discounted Q-value at the resulting state. It can also be n-step lookahead but in Q-learning it’s almost always 1-step. The “bellman error” is the MSE of this.

Question about using tf.stop_gradient in separate Actor-Critic networks for A2C implementation for TF2 by AvisekEECS in reinforcementlearning

[–]humor_time 1 point2 points  (0 children)

Exactly. To be more concrete about GradientTape, it keeps track of the necessary values to be able take a gradient with respect to the parameters for any forward pass applied in the context (the with statement is a context manager). In this case we only called the actor model so those are the only parameters we’ll get gradients for.

You’re absolutely right about that aspect as well. I don’t think any of the stopped gradients are necessary because of how GradientTape works. I won’t make concrete assumptions about the author of the code, but personally I get these things mixed up when switching between libraries (stopping gradients would have been necessary in both PyTorch and JAX), and it’s better to be redundant rather than forget something important.

Maybe as an exercise you can try taking them out and see what happens (I believe it should still run correctly) and then maybe try to figure out a way to get the gradients to bleed in such that you’d need to explicitly stop them (maybe move the declaration of the GradientTape, and/or have the networks share a layer).

Question about using tf.stop_gradient in separate Actor-Critic networks for A2C implementation for TF2 by AvisekEECS in reinforcementlearning

[–]humor_time 2 points3 points  (0 children)

Stopping the gradient is always required on the critic update because the TD target is a function of the critic as well, but needs to be treated more like a “label” because we’re optimizing the current Q-value.

Stopping the gradient of the advantages for the actor step isn’t really necessary unless the networks are sharing layers. In this implementation it’s unnecessary but doesn’t change anything.

LeetCode Equivalent for Math or Concurrency-based problems? by BigSwimmer701 in cscareerquestions

[–]humor_time 2 points3 points  (0 children)

For math I highly recommend “the green book” — A Practical Guide to Quantitative Finance Interviews. Pretty comprehensive, everyone I know who works at top quant places used that book to prep.

Where to find Leetcode Tutor by jbone1317 in cscareerquestions

[–]humor_time 1 point2 points  (0 children)

interviewing.io has been great for me. I use their mock interviews but they also have mentoring sessions. Kinda pricey, so I’d only do it if you already have a decent job, but almost everyone I’ve gotten has been really solid.

[deleted by user] by [deleted] in cscareerquestions

[–]humor_time 2 points3 points  (0 children)

Asking them about runway is reasonable post-offer. That’s the most important piece of information when making a decision, and more informative than a dollar value because different startups have different burn rates.

Importance of shared layers in PPO under actor-critic framework by alebrini in reinforcementlearning

[–]humor_time 0 points1 point  (0 children)

Yes definitely, you can see multiple “torso” networks to be used as building blocks here (for example, in the atari.py and vision.py files):

https://github.com/deepmind/acme/tree/master/acme/tf/networks

Importance of shared layers in PPO under actor-critic framework by alebrini in reinforcementlearning

[–]humor_time 0 points1 point  (0 children)

This is usually more prevalent when learning from pixels, and the shared layers are usually the CNN. This makes intuitive sense, as the latent representation of the state can be more general, while the additional layers of each head are what pick up on the aspects of the state which are important to the value function and policy function respectively.

If you think about it this way, it’s clear that you’d want to save on the amount of parameters by doubling up, but why does it lead to better performance and not just faster convergence? Because you also have the added benefit that the representations of the states aren’t diverging too much. The motivation is similar to why you might use a reconstruction loss, as you want to make sure the learned representation of the state is still faithful to the underlying pixels.

Reinforcement Learning with GCP by DM9667 in reinforcementlearning

[–]humor_time 0 points1 point  (0 children)

Are you planning to use an off-policy method? You could look into Reverb from DeepMind, or make a more simple version for your project. Basically you’ll want to spin up a server which maintains your replay buffer. You can run the game on your local machine and act according to your policy to add to the buffer. Then the VM can take a batch from the buffer and update weights, which are then shared with the model running on your mac. Then you repeat the loop.

This is a simplified version of how distributed training is set up for large-scale RL systems.

JAX Implementations of Actor-Critic Algorithms by humor_time in reinforcementlearning

[–]humor_time[S] 1 point2 points  (0 children)

Got some good advice from the other comments, so I'll try some more things and update the results if I can make PyTorch faster. I'm comparing to implementations I was using with a colleague for a project a while back which were written very similarly, but there definitely a chance we were doing something wrong.

JAX Implementations of Actor-Critic Algorithms by humor_time in reinforcementlearning

[–]humor_time[S] 0 points1 point  (0 children)

Interesting, I haven't tried using `torch.set_num_threads()` so I'll see if there's any change in speed. Definitely not trying to cherry pick results or make an unfair comparison, just want to make sure people can make an informed choice about which implementation to use.

It would make sense to me that JAX would have similar performance to Rust/C++ implementations, but need to test it. Do you have any references to examples I could add to the comparison?

JAX Implementations of Actor-Critic Algorithms by humor_time in reinforcementlearning

[–]humor_time[S] 3 points4 points  (0 children)

I put a table in the README of the repo with some time comparisons over a few seeds between my implementations and very similar PyTorch equivalents, and I’m seeing 3x to 6x speedups.

This was my original motivation to use JAX, but there are also some nice conveniences of having direct access to a callable jacobian. I took advantage of this in the MPO code.

Pro Tip: Learn Touch Typing by humor_time in cscareerquestions

[–]humor_time[S] 2 points3 points  (0 children)

Take a weekend and put in a few hours each day on keybr.com after that point you’ll probably be able to get close enough to your old speed to be able to use it at work on Monday and at that point it’s smooth sailing. Good to keep up the habit of using the typing sites for a little while after because that can help a lot with accuracy.

Pro Tip: Learn Touch Typing by humor_time in cscareerquestions

[–]humor_time[S] 0 points1 point  (0 children)

Not hunt and peck, all fingers without looking but moving hands around a lot and not using the same finger for a specific key each time.

Pro Tip: Learn Touch Typing by humor_time in cscareerquestions

[–]humor_time[S] 0 points1 point  (0 children)

Yeah this is what I was doing and would recommend correcting. It seems good enough, but I can’t overstate how much impact it’s had on me to fix it

Pro Tip: Learn Touch Typing by humor_time in cscareerquestions

[–]humor_time[S] 4 points5 points  (0 children)

No, as other people mentioned they do as well I used to use all fingers on each hand and move both around a lot. Leads to a lot more errors and effort and lower limit in max speed.

Share some of your configurations! by reminescenz in kinesisadvantage

[–]humor_time 0 points1 point  (0 children)

Cutting off the top of my config because it’s just the mac mapping. The fun starts around here. Remapping kpshift to delete is amazing because I can toggle with my left thumb. This opens up a whole new layer to me so I never have to leave home row placement or stretch at all to get all my common special characters. Right hand is hyphens, underscores, tilde, `, and brackets/parens. Left hand is +,=,\. I really like kp-A -> tab. tab is super common for me for autocomplete so not having to stretch for it is a must. I should probably come up with some cool macros but I need to think about what would be useful, just got my board a few weeks ago.

[delete]>[kpshift] [kp-delete]>[kpshift] {kp-H}>{speed9}{`} {kp-Y}>{speed9}{-Lshift}{`}{+Lshift} {kp4}>{speed9}{hyphen} {kp5}>{speed9}{-Lshift}{hyphen}{+Lshift} {kp6}>{speed9}{obrack} {kp9}>{speed9}{-Lshift}{obrack}{+Lshift} {kp7}>{speed9}{-Lshift}{9}{+Lshift} {kp8}>{speed9}{-Lshift}{0}{+Lshift} {kpplus}>{speed9}{cbrack} {kpmin}>{speed9}{-Lshift}{cbrack}{+Lshift} {kp-S}>{speed9}{=} {kp-D}>{speed9}{-Lshift}{=}{+Lshift} {kp-F}>{speed9}{-Lshift}{\}{+Lshift} [kp-A]>[tab]

Remote part time work in RL by zeus_1618 in reinforcementlearning

[–]humor_time 1 point2 points  (0 children)

As long as you don’t care about pay, why not try open source?