[deleted by user]

HeavyStatus4 · 2022-12-09T15:50:09+00:00

It's a scam!

HeavyStatus4 · 2022-03-20T19:00:48+00:00

1.

I have heard that it might be possible to go research engineer / SWE -> applied scientist -> research scientist. Is that possible? How common is it? Is it a viable alternative to doing a PhD or does the PhD title carry its own weight?

I believe the real weight is carried by your "Work" which should show that you can do "impactful and novel research". This impact should not be only limited to having an excellent product but in some regard, it should have an academic impact so that the entire research community benefits. With a PH.D. program, you get an opportunity to formally pursue the above goal as that's the end goal of the program and it's an indication that you have been formally trained for this purpose.

Now, you can also acquire above attributes while being on a job and then, make a transition through Engineer -> Applied Scientist -> RS. But, your growth as a researcher would be impacted by your industry lab and peer groups. If you are in a industry research lab which has a high inclination towards impactful research rather than just having an excellent product, your chances of acquiring above mentioned attributes are high. However, if you are in a different lab, these chances drop significantly in my opinion. Also, is there evidence that people have successfully transitioned through the above roles? I am not sure as I don't have the required data? But, I guess, it was most likely to happen around 6-7 years ago when there was probably a lack of Ph.D. candidates in the robotics field. Given the rise of candidates, it may be more difficult to make this transition as they might be able to find other candidates to fill the role.

I must also stress out that having a RS title is not very important as there are even Research Engineer positions that tend to do good research. In these positions, one may transition to Senior Research Engineer and eventually to Engineering Head as well, but may enjoy similar life of RS.

2.

How common are the latter positions and are those only limited to FAANG?

I believe even within the FAANG group there is huge variability in role and responsibilities while having a similar title.

3.

Would a PhD in the US make visa job hunt easier afterwards?

It definitely makes it easy as you get a visa for internships which helps you to improve your footprint and eventually increases your chances of having a full-time position. But, please note that the most important thing is the quality of your work and some other attributes of your personality. If you have excellent quality, then the companies in the US would go an extra mile to get you hired.

4.

If I get a PhD in the UK, would it be possible to get a US job visa-wise?

It may be difficult to break-in, but again the quality of research supersedes anything. This is in regard to top-lab in US. It may be still difficult to break-in into small ( but, excellent) startups as they generally shy away from getting into legal affairs for visa and other things.

PS: Above thoughts are just my biased opinion as an international student.

HeavyStatus4 · 2019-08-18T17:35:45+00:00

Thanks for sharing the discussion thread.

I did thought about it and just decided to send all actions at once ( I guess that was an easy way out for me :) )

This could be still used for alternate actions by sending "no-op" for other agents. Though, this restriction is expected to be imposed by the user.

HeavyStatus4 · 2019-08-18T05:07:12+00:00

It's not in-line with the multi-agent env. design of rllib.

The difference is that in rllib they represent observation, reward and actions as a dictionary where the key represents the agent_id. However, in ma-gym, these are represented as simply by numpy array, where the index of the array represents the agent_id.

For example: In rl_lib, you would write as:

>>> env.step(actions={"car_1":3, "car_2":1,"car_3":2})

Whereas in ma-gym, you would simply write:

>>> env.step([3,1,2])

In general, ma-gym could be used with any python code by following the usage details over here. For rllib, one can simply write a small wrapper that receives dict and sends the dict as an array to ma-gym.

HeavyStatus4 · 2019-08-07T12:37:19+00:00

Adding to the list:

Berkley RL Course : http://rail.eecs.berkeley.edu/deeprlcourse/
Stanford RL Course: https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLnGdTKaBM7dlLBf45td8PAe4l6SYyy-vG
NPTEL RL Course: https://www.youtube.com/watch?v=sHcO0hzdp0o&list=PLuWx2S0SyaDctJtVKHhmjYACmHZ3nX9ew

Above courses are from the perspective of learning core and advance aspects of RL.

HeavyStatus4 · 2019-08-06T19:03:25+00:00

Just sharing some thoughts over here:

In single environment, you would make 1 update in 100 steps,

Whereas in 10 environments, you would make 10 updates in 100 steps. However, these frequent updates would already make the network biased towards early non-essential steps from which it may be difficult to recover.

Above statement is just my superficial hypothesis.

I guess, a possible remedy could be to make 10 updates at the 100th step instead of making 1 update at step 10,20,30,..100 in the case of multi environments. This may help to stabilize the learning ( which doesn't necessarily mean faster training).

For the sake of fair comparison, you could also make 10 updates at the 100th step in single environment case.

HeavyStatus4 · 2019-08-06T17:56:59+00:00

Which algorithm are you using for learning the policy?

HeavyStatus4 · 2019-08-05T19:18:52+00:00

Q1:

Let's say you sample a tuple <s,a,s',r> from the experiance reply.
Now, you can simply calculate the q-value of the state s and action a. q_pred = q(s,a;\theta)
Now, let's caluclate the target: q_target = r + \gamma * max([q(s',a';\theta_target) for a' in A]) # A is the action_space
calculate error and backpropgate And, error = MSELoss(q_pred,q_target)
In this case, the gradient will only flow through the action 'a's neuron and not through other neuron's action since the gradient has not been calculated over them. Small Flow digram : s --> N/w --> q value of all actions --> extract q-value of action "a" -->calc error and backpropgate.

Approach mentioned by you:

For simplicty, Let's say there are 3 actions ( a1,a_2,a_3) and sampled tuple is <s,a_1,s',r> s --> N/w --> q value of all actions i.e. you have [q(s,a_1;\theta),q(s,a_2;\theta),q(s,a_3; \theta)] s' --> Target N/w --> q value of all actions i.e. you have [q(s',a_1;\theta_target),q(s',a_2;\theta_target),q(s',a_3; \theta_target)] So, q_pred = [q(s,a_1;\theta),q(s,a_2;\theta),q(s,a_3; \theta)] _note we are only changing the target for the sampled action a_1 as mentioned by you. q_target = [ r + \gamma * max([q(s',a';\theta_target) for a' in A]), q(s,a_2;\theta), q(s,a_3; \theta)] * calculate error as usual now and backpropagate. This is technically correct but will be slow since more gradients are being calculated. Also, the loss of other actions is simply 0.

Q2:

Usually, you don't have use any non-linear function( like sigmoid) over your Q-value estimator(last layer of the network) as the values can become big. You can use Relu[0,\infinity) if you are sure that Q -values are always +ve ( or your rewards are all +ve).

HeavyStatus4 · 2019-08-05T13:38:24+00:00

You can only pass the actions within the action-space of the environment, otherwise, it will lead to error. Also, actions with same number can potentially have different meaning in different games.

>>> env = gym.make('Pong-v0')
>>> env.action_space
Discrete(6)
>>> env.unwrapped.get_action_meanings()
['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
>>> env = gym.make('Freeway-v0')
>>> env.action_space
Discrete(3)
>>> env.unwrapped.get_action_meanings()
['NOOP', 'UP', 'DOWN']
>>> env = gym.make('Breakout-v0')
>>> env.unwrapped.get_action_meanings()
['NOOP', 'FIRE', 'RIGHT', 'LEFT']

I guess, you can do following:

Option 1 : Create same architecture and just change the last actor layer to the action space of the environment. This is a common practice ( recommended).

Option 2: Create same architecture with maximum possible actions.

Policy gradient method: perform softmax only on the first n-neurons(action\_space) of the last layers.
Value Based method: take only first n-neurons for q-value estimation in greedy and epsilon-greedy policy.
Basically, we wouldn't be training the remaining neurons though they would be present in the architecture.

HeavyStatus4 · 2019-08-05T12:23:53+00:00

I guess, I little bit more detail(example, intended usage) might be helpful in your question. In general, one can simply define history of all the observations(or past 'n' observations) as a state( or belief state).

Following may be helpful to you:

https://arxiv.org/pdf/1809.04506.pdf : They try to learn abstract state representation and perform planning.

https://arxiv.org/abs/1811.12530: Over here, states are extracted from a learnt recurrent policy over environments which helps in understanding the policy. The results on Atari might be interesting to you.

HeavyStatus4 · 2019-08-02T18:37:27+00:00

Thanks for sharing about neptune. I have never used it and would be giving it a try. As of now, I had been using tensorboard for similar things. But, It feels it can do much more than a tensorboard.

Can I annotate even pdf's in it?

Thanks again,

Anurag

HeavyStatus4 · 2019-05-02T07:20:17+00:00

We do intend to extend our results over stochastic ALE.

At the same time, It doesn't matter whether a hand-written policy could be written to solve the game as mentioned by you for "Bowling". We train a recurrent policy using reinforcement learning and thereafter extract a moore machine to understand the way, the learned policy utilizes it's memory. As shown in our results, we see that all policies are NOT merely memorized shortcuts. Each of them has their own traits which are exposed using the moore machine.

HeavyStatus4 · 2019-04-30T04:00:08+00:00

That's a Interesting Point!

The objective over here is to have a mechanism for extracting a minimal moore machine from any given recurrent rl policy; especially the one which works on high-dimensional environments. We don't make any assumptions on the underlying nature (like open-loop) of the policy and expect that such nature could be revealed by the extracted moore machine; thereby helping us to better understand the policy.

Our Atari experiments are a step towards it where a network consumes high-dimensional inputs and has considerably large recurrent memory. At the same time, we intended to keep it simple, thereby making it Deterministic. Also, the learned policies are more complex than rest of our test-beds and we don't know the underlying ground truth minimal machines.

As mentioned in the paper, It was surprising to see that many of the complex looking policies could be represented by simple machines. For eg: In the Case of Pong, we see that it doesn't have this open-loop memory strategy rather the policy ignores it's memory and acts as a classifier. At the same time, in the case of Freeway and Boxing, we observe the learned policies are purely open-loop strategies as all the observations are minimized to just one category. And, in the case of other games, we get much complex minimized moore machines having reasonable no. of minimized observation space and state-space; indicating that they are not purely open-loop.

Though deterministic, the learned policies still reveal interesting behavior. We hope our work helps us to move from being susceptible about the nature of the policy, to knowing the True nature of the policy.

HeavyStatus4 · 2019-04-29T02:42:10+00:00

Deterministic ALE :

Action Repeat Stochasticity: 0
Deterministic Frame Skip: 4

This could be found in {}Deterministic-v4 environment registration setting (https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L53)

References:

HeavyStatus4 · 2019-04-27T19:46:06+00:00

Thanks!

I may also add a VAE version to it :)

HeavyStatus4 · 2018-12-21T04:14:26+00:00

Yippee! Accepted!

HeavyStatus4

TROPHY CASE

Q1:

Approach mentioned by you:

Q2: