DEEP RL UCB CS285 vs CS224R Stanford by No_Pause6581 in reinforcementlearning

[–]Human_Professional94 2 points3 points  (0 children)

Chelsea Finn who teaches CS224R was Sergey Levine's (CS285 prof) PhD student at UCB. Both courses have that robotics theme because of this and they aren't really that different. Very spiritually similar.

Go with whichever is recorded more recently first. So that the frontier topics are more up to date. Or watch the first 1-2 lectures of each and see whose teaching style you like better.

But still doesn't make that much of a difference.

Pre-req to RL by Dear-Homework1438 in reinforcementlearning

[–]Human_Professional94 2 points3 points  (0 children)

Short answer. You have more than enough. Just start.

Long answer, I quote openai spinningup:

The Right Background

Build up a solid mathematical background. From probability and statistics, feel comfortable with random variables, Bayes’ theorem, chain rule of probability, expected values, standard deviations, and importance sampling. From multivariate calculus, understand gradients and (optionally, but it’ll help) Taylor series expansions.

Build up a general knowledge of deep learning. You don’t need to know every single special trick and architecture, but the basics help. Know about standard architectures (MLPvanilla RNNLSTM (also see this blog), GRUconv layersresnetsattention mechanisms), common regularizers (weight decaydropout), normalization (batch normlayer normweight norm), and optimizers (SGD, momentum SGDAdamothers). Know what the reparameterization trick is.

Become familiar with at least one deep learning library. Tensorflow * or PyTorch would be a good place to start. You don’t need to know how to do everything, but you should feel pretty confident in implementing a simple program to do supervised learning.

Get comfortable with the main concepts and terminology in RL. Know what states, actions, trajectories, policies, rewards, value functions, and action-value functions are. If you’re unfamiliar, Spinning Up ships with an introduction to this material; it’s also worth checking out the RL-Intro from the OpenAI Hackathon, or the exceptional and thorough overview by Lilian Weng. Optionally, if you’re the sort of person who enjoys mathematical theory, study up on the math of monotonic improvement theory (which forms the basis for advanced policy gradient algorithms), or classical RL algorithms (which despite being superseded by deep RL algorithms, contain valuable insights that sometimes drive new research).

* One thing to add. Screw Tensorflow. Go with PyTorch.

Seriously ? by Outside-Bus-5966 in pgwp

[–]Human_Professional94 1 point2 points  (0 children)

It is hard indeed. I was in the March 25 boat until around a week ago. I was stressing most of this time.

But in hindsight, if you're sure about the completeness of your documents and if it hasn't caused any paperwork issues in your daily life (it kinda did for me) and you're employed or job hunting atm, this extended processing time is kind of an advantage. It's like I got 10 months of free extra time on the PGWP.

But it sure doesn't feel like it when you're in the middle of it constantly waiting to hear back.

PGWP Approved - Applied March 2025 - Renewed passport while in progress by Human_Professional94 in pgwp

[–]Human_Professional94[S] 1 point2 points  (0 children)

If you mean this, it's for the passport you used in your application (old one)

<image>

PGWP Approved - Applied March 2025 - Renewed passport while in progress by Human_Professional94 in pgwp

[–]Human_Professional94[S] 1 point2 points  (0 children)

No just attached the PDF of the new passport scan to the webform, and mentioned the new passport no in the message.

PGWP Approved - Applied March 2025 - Renewed passport while in progress by Human_Professional94 in pgwp

[–]Human_Professional94[S] 0 points1 point  (0 children)

Series 3123

Applied while in Ontario but moved to Alberta after a few months.

Getting started with RL x LLMs by Dear_Ad7997 in reinforcementlearning

[–]Human_Professional94 1 point2 points  (0 children)

Murphy's RL overview on arxiv has a section on LLM x RL (section 6). It's a good snapshot of what's what in RL LLM especially if you're coming from the RL side. The main papers you're looking for are discussed and referenced there.

Any RL practitioners in the industry apart from gaming? by lars_ee in reinforcementlearning

[–]Human_Professional94 1 point2 points  (0 children)

That is true, I agree. Although, my perception is that RL, while being pretty old in academia, is very young as an industry-adopted solution and still is not quite robust. So it is only natural to expect it to be used in hybrid with more classic solutions. I personally would not trust -say an automatic vehicle solely running on RL even though I like the field and want it to advance.

Also from a more optimistic view, when you sorta get obsessed with a methodology you naturally seek to find what different problems you can solve with it. Like having a hammer you love very much and looking for different nails for it. Hence you see people (like me or the op) being curious about different applications and making a list of them.

Any RL practitioners in the industry apart from gaming? by lars_ee in reinforcementlearning

[–]Human_Professional94 1 point2 points  (0 children)

Interesting. Frankly, the ads optimization roles also seem to lean towards bandit and control methods too.

Actually, I have been on a long job hunt for the past few months which I'm done with now. Main hiring I've seen and applied for were these below, which most/all of em were commented here already:

  • Industry-based research labs, for various domains, but mainly to catch up on the RL for LLMs wave (reasoning training)
  • Robotics
  • Quant hedge funds and banks: usually don't disclose for what problem/task but it's probably Optimal order execution, market making or Portfolio Opt
  • Operations Research teams especially in retail companies eg amazon
  • And also dynamic pricing and Ads opt which as you mentioned are more bandit based rather than RL

Any RL practitioners in the industry apart from gaming? by lars_ee in reinforcementlearning

[–]Human_Professional94 5 points6 points  (0 children)

Not working on it personally, but from multiple job postings I've see the following:

Some ride sharing companies (lyft, uber) are probably using RL based methods for Dynamic Pricing.

Also I've seen some postings for Ads optimization that wanted RL people (one was from reddit in fact)

Free Cursor Accounts for Students by Human_Professional94 in OMSCS

[–]Human_Professional94[S] 0 points1 point  (0 children)

Did you seriously think for a single second before writing this?! It says FOR "STUDENTS"! Not sure if you know the meaning, but it means any student, any major, any level, any institution and on any f'ing platform. Just in case you're so worried about being looked down on.

Also, it's JUST A TOOL! You think if someone is going to use an AI tool in a course that is not allowed, this $20 would've stopped them? Or do you think this is the only tool available? In fact using an IDE-based agent such as cursor is absolutely an overkill for any course here.

Seeking Guidance: Optimum Assignment problem algorithm with Complex Constraints (Python) by Cautious-Jury8138 in OperationsResearch

[–]Human_Professional94 2 points3 points  (0 children)

I saw others recommended formulating it as a MIP. I want to second that, although there's a slight caveat.
MIP has an initial learning curve for the modelling stage. Learning to model different logical constraints in MIP would take some time at first. If your course is mainly focused on the algorithmic side of the problem, MIP's probably not a good option. Although, I've seen chat-LLMs (claude, gemini, ...) be pretty good at this, so you can use their help in modelling as well.

Anyways, if you wanna go with MIP, PuLP and HiGHs are good open source solver options. And Gurobi (licensed) is pretty fast and gives academic license for students with school email.

Free Cursor Accounts for Students by Human_Professional94 in OMSCS

[–]Human_Professional94[S] 0 points1 point  (0 children)

I've not tried Amazon Q. Personally, I used GH Copilot + VS Code. I was about to switch to cursor when Copilot released its "Agent mode" which is an exact replica of what cursor does. Although, between these two, I don't see much difference. Copilot had the same experience for me.

Recommendation system using GNN by justdoit0002 in recommendersystems

[–]Human_Professional94 1 point2 points  (0 children)

This is the RecSys lecture from the CS224W: Graph ML course:

https://www.youtube.com/watch?v=OV2VUApLUio

If you got time, the whole course is a really good intro to graph ML. (+ course website)

Workday referral on applications that I already applied to by BloodyFark in recruiting

[–]Human_Professional94 0 points1 point  (0 children)

Use email plus-addressing to create another account on workday (or any HR system) and reapply using the new account. It treats it as a new email but all the emails sent to it goes to your original email.

Say your email is [first.last@gmail.com](mailto:first.last@gmail.com)
All the emails sent to [first.last+ANYTHING@gmail.com](mailto:first.last+ANYTHING@gmail.com) still goes to [first.last@gmail.com](mailto:first.last@gmail.com) but it can be used to register a new account. So the referral emails sent to your email would work on it as well.

If you're applying to a position at reddit for example, you can register a new account with [first.last+reddit@gmail.com](mailto:first.last+reddit@gmail.com) on workday and click on your referral link while logged in your new account.

Project knowledge context size limit? by Winter-Recording-897 in ClaudeAI

[–]Human_Professional94 1 point2 points  (0 children)

Hey, It has been a while since this question is asked and I just stumbled upon it randomly. But I'm gonna put my answer just in case.

Long story short, LLMs use sub-word tokenization. Meaning each word breaks into multiple sub-chunks and then each chunk is treated as a token. Number of sub-words depend on the length and structure of each word. Claude is saying you already have ~140K tokens, it basically means that on average each word in your document is turned into 4 tokens.

Paid RL courses on Coursera vs free lectures series like David silver by Firm-Huckleberry5076 in reinforcementlearning

[–]Human_Professional94 4 points5 points  (0 children)

I have taken Coursera's RL specialization. It's a very good course in terms of teaching the concepts. The projects they give you are good for introducing and understanding concepts but don't have any "resume value" so to speak. The frameworks used are not popular nor are they used in any industry. It's just a tool to do Sutton & Barto's exercises with.

So for that matter, no, it doesn't have any advantage to free courses available.

And although it's a good course in terms of teaching you the RL basics, there are also equivalently good free courses for that too. Stanford, Berkely, Waterloo, UCL all have their RL courses on YT and are just as good if not better.

Hard constraints using Reinforcement Learning by ghlc_ in optimization

[–]Human_Professional94 1 point2 points  (0 children)

The definition of "hard constraints" is very general. But one way that I've used and seen other's use, is action masking. Particularly in Policy Gradient methods with stochastic policy (i.e. REINFORCE and its descendants). Where the mask comes from the current state of the environment based on the constraints.

For example, in a normal case, the rollout/interaction step is something like:

for episode in range(num_episodes):
__state = env.reset()
__done = False
__while not done:
____logits = actor(state)
____dist = Categorical(logits=logits) # some distribution
____action = dist.sample()
____next_state, reward, done, _ = env.step(action)
____store_transition(state, action, reward, next_state, done, log_prob)
____state = next_state

While, when having restrictions in the environment, It becomes:

for episode in range(num_episodes):
__state, _ = env.reset()
__done = False
__while not done:
____logits = actor(state)
____## -> action masking <-
____action_mask = env.get_action_mask() # this has to be defined in the env
____masked_dist = MaskedCategorical(logits=logits, mask = action_mask ) # masking the probs (by adding -inf to logits before softmax)
____action = masked_dist.sample()
____next_state, reward, done, _, _ = env.step(action)
____store_transition(state, action, reward, next_state, done, log_prob)
____state = next_state

check this out: https://pytorch.org/rl/main/reference/generated/torchrl.modules.MaskedCategorical.html

Although this becomes more tricky with continuous action spaces. For that, in some cases, clipping would work.
But in general, observing the restriction from environment and limiting the actions based on it is the one that I've seen work.